Finding Pearls: Psychometric
Reevaluation of the
STEVEN V. OWEN
Department of Epidemiology and Biostatistics, The University of Texas Health Science
Center at San Antonio, San Antonio, TX 78229-3900, USA
MARY ANNE TOEPPERWEIN, CAROLYN E. MARSHALL
Barshop Institute for Longevity and Aging Studies and Frederic C. Bartter General
Clinical Research Center, The University of Texas Health Science Center at San Antonio,
San Antonio, TX 78229-3900, USA
MICHAEL J. LICHTENSTEIN
Barshop Institute for Longevity and Aging Studies; Division of Geriatrics and
Gerontology, Department of Medicine; and Frederic C. Bartter General Clinical Research
Center, The University of Texas Health Science Center at San Antonio, San Antonio, TX
CHERYL L. BLALOCK, YAN LIU, LINDA A. PRUSKI, KANDI GRIMES
Barshop Institute for Longevity and Aging Studies and Frederic C. Bartter General
Clinical Research Center, The University of Texas Health Science Center at San Antonio,
San Antonio, TX 78229-3900, USA
Received 24 October 2007; revised 27 March 2008, 9 May 2008; accepted 19 May 2008
Published online 7 October 2008 in Wiley InterScience (www.interscience.wiley.com).
ABSTRACT: The Simpson–Troost Attitude Questionnaire (STAQ) was developed as part
of a study to assess adolescent commitment to and achievement in science. For this psycho-
metric reappraisal of the 57-item STAQ, data were analyzed from a convenience sample
of 1,754 secondary students. Conﬁrmatory and exploratory factor analyses were applied,
and results suggested that the STAQ can be shortened from 57 to 22 items spanning ﬁve
Correspondence to: Mary Anne Toepperwein; e-mail: Toepperwein@uthscsa.edu
The contents of this report are solely the responsibility of the authors and do not necessarily represent
the ofﬁcial views of the National Center for Research Resources, National Institute on Aging, National
Heart, Lung, and Blood Institute, or National Institutes of Health.
2008 Wiley Periodicals, Inc.
REAPPRAISAL OF THE SIMPSON–TROOST 1077
revised dimensions. These ﬁve dimensions were arranged into an exploratory structural
equation model, which showed some grade and gender differences and strong associations
among classroom environment, self-directed effort, and science affect. Findings raise the
potential for teacher professional development to improve science classroom activities, and
inﬂuence student self-directed effort and science affect. C
2008 Wiley Periodicals, Inc. Sci
Ed 92:1076 – 1095, 2008
From a global perspective, scientiﬁc and technological development is occurring at a
rapid pace in many countries, while in the United States such development is slowing
and perhaps even eroding (National Academy of Sciences, 2006). In their report Rising
Above the Gathering Storm: Energizing and Employing America for a Brighter Economic
Future, the National Academy of Sciences issued their prioritized recommendations for
preserving the “science and technology enterprise so that the United States can successfully
compete, prosper, and be secure in the global community” (p. ES-1). Signiﬁcantly, the top
priority was substantive improvement of K-12 science and mathematics education in the
United States. With global scientiﬁc and technological growth occurring rapidly, declining
student interest in science courses and careers is a worldwide concern that has prompted
science education reform efforts on an international scale (e.g., InterAcademy Panel on
International Issues, 2008; National Science Teachers Association, 1996). Since student
attitudes toward science affect course and career choices, measuring the impact of reform
efforts on student attitudes is important (e.g., Koballa, 1988; Rennie & Punch, 1991; Wyer,
2003) and will require measurement tools with robust psychometric properties. Reviews
of science attitude instruments have found that few have sufﬁcient psychometric data and
called for more rigorous analyses of existing tools (Blalock et al., 2008; P. L. Gardner, 1975;
Munby, 1980; Osborne, Simon, & Collins, 2003). Applying modern psychometric analyses
to historically accepted and used instruments can result in shorter, more compact scales
with stronger psychometric properties, thus setting the stage for further study (Marshall et
al., 2007; Owen et al., 2007). Alternately, such an approach may reveal that a scale does
not have sufﬁcient psychometric properties to warrant recommendation for continued use
(Blalock et al., 2008; Lichtenstein et al., 2008).
In the 1970s and 1980s, signiﬁcant work was done internationally to develop instruments
that measure attitudes toward science (Fraser, 1977; C. Gardner, 1975; Khalili, 1987; Lin &
Crawley, 1987; Osborne et al., 2003; Schibeci & McGraw, 1981; Simpson, Koballa, Oliver,
& Crawley, 1994). However, statistical and psychometric methods have improved sharply
since that time. This improved methodology creates a timely opportunity to reevaluate
the factor structure and reliability of existing scales. With this goal in mind, and because
the Simpson–Troost Attitude Questionnaire (STAQ) continues to be used in contemporary
science education research, we sought to reevaluate its psychometric properties.
Development and Use of the STAQ
Concerns about scientiﬁc literacy and the loss of American leadership in science and
technology were major issues in the early 1980s. It was in this climate that Simpson
and Troost (1982) developed the STAQ as the centerpiece of a longitudinal, multimethod,
and multidimensional study to examine adolescent commitment to and achievement in
science. Simpson and Troost’s study was funded by the National Science Foundation,
spanned a decade, and aimed to provide longitudinal data for a model that could effectively
predict how students develop lifelong learning habits related to science. Their plan was to
1078 OWEN ET AL.
use this information to inform science education policy in the United States (Simpson &
Troost, 1982; Simpson & Oliver, 1990).
Simpson and Troost (1982) described the theoretical framework for their multistage study.
One of the stages involved creation of instruments that would, over a number of years, be
given to students and teachers to provide baseline information and to assess changes. The
overall study was designed to determine whether affective factors related to self,home and
family, and school and classroom, inﬂuenced student commitment to and achievement in
science. The Simpson and Troost deﬁnition of commitment to science included “attitudes,
interests, values, and other affective behaviors identiﬁed in the study” (p. 765). Existing
instruments that might be used to measure the variables in the study were evaluated and
deemed insufﬁcient because of “readability, item complexity, and low internal consistency”
(p. 771). Therefore, a major initiative of their study became the development of instruments
that could measure relevant variables.
The STAQ was developed and pilot tested with secondary students. Results from several
factor analyses created a ﬁnal pool of 58 Likert-type items comprising 14 subscales (Oliver
& Simpson, 1988; Simpson & Troost, 1982). Simpson and Troost (1982) used factor
analysis to conﬁrm that factors could be replicated. Content validity was evaluated by an
expert panel and item analysis, although no data were provided in their 1982 publication.
Exploratory factor analysis was mentioned, but no detail was provided, and conﬁrmatory
factor analysis was not used. In fairness, conﬁrmatory factor analysis was just emerging
in psychometric research at the time the STAQ was developed, so it is not surprising that
factor analysis results are incomplete.
Reliability and Validity Evidence for the STAQ
Our comprehensive review of studies using the STAQ is summarized in Appendices A
(peer-reviewed literature) and B (non-peer-reviewed literature, e.g., dissertations), both of
which are available at http://teachhealthk-12.uthscsa.edu/STAQ-App.pdf. In their validation
of the STAQ, Simpson and Troost (1982) modiﬁed the initial instrument based upon factor
analysis of data from 4,508 students.In the original study, no reliability estimates were
reported for the 14 subscales. Later peer-reviewed studies provided insight into reliability
evidence by reporting subscale KR20 values or Cronbach’s alpha with coefﬁcients ranging
from .33 to .95 (Atwater, Wiggins, & Gardner, 1995; Cannon & Simpson, 1985; Hill,
Atwater, & Wiggins, 1995; Oliver & Simpson, 1988; Simpson & Oliver, 1985, 1990; Talton
& Simpson, 1985, 1986, 1987). Dissertation studies using the STAQ reported Cronbach’s
alpha values for the subscales ranging from .27 to .95 (Cannon, 1983; C. Gardner, 1992;
Maidon, 2001; Oliver, 1986; Spellman & Oliver, 2001). Information provided by Dr. Oliver
(personal communication, September 30, 2004) reported subscale reliability coefﬁcients
ranging from .35 to .90 on ten subscales; information on four subscales was not included.
By today’s standards, the reliability evidence indicates that some subscales do not provide
a dependable measure of their construct.
Content validity information about the STAQ is limited, including the use of judgments
from expert panels (Cannon, 1983; Simpson & Troost, 1982) and professional test writers
(Cannon, 1983). With modern psychometric methods available, a more quantitative index
of content validity would help researchers evaluate the value of the STAQ as a research
tool (Polit, Beck, & Owen, 2007, in press). Exploratory factor analysis was mentioned
(Greenﬁeld, 1997; Simpson & Oliver, 1985; Simpson & Troost, 1982), but neither summary
data nor full explanations of procedures were provided. No published study has reported
conﬁrmatory factor analysis on the STAQ. In short, scant psychometric data have been
reported since Simpson and Troost’s (1982) original study. It is important to note that, with
REAPPRAISAL OF THE SIMPSON–TROOST 1079
the exception of Greenﬁeld (1997), all studies reviewed here used the original Simpson and
Troost (1982) data set.
In summary, STAQ content validation was done by a team of expert judges and teachers,
reports of internal consistency have been highly variable, item analysis was undeﬁned,
and factor analysis results were poorly described. However, because of the theoretical
underpinnings, large data set, and longitudinal design, Simpson and Troost (1982) set the
stage for a long history of research and validation. Our aim in this paper was to reassess the
STAQ with a new and large data set. Using current statistical techniques, we reevaluated
the psychometric properties and internal structure of the STAQ, and then applied structural
equation modeling to explore how STAQ variables interact.
The instrument, obtained from Drs. Simpson and Oliver, consisted of 59 items with
ﬁve-point Likert-type responses, plus two supplementary biographic items that we did not
use (living arrangements of the respondent and whether the respondent was right-handed
or left-handed). In addition, one item was duplicated, which according to Dr. Simpson was
a typing error, so we deleted one of these items. One school district requested that one
sensitive item be removed (I argue a lot with my family). Thus in the current study, the
administered instrument consisted of 57 items. Permission to use the instrument was given
by Dr. Simpson (personal communication, March 18, 2004). The project was approved by
the sponsoring university’s institutional review board.
The STAQ was administered to a geographically constrained convenience sample of
1,812 secondary students from 23 teachers in 8 schools, and within 6 districts in the Bexar
County (Texas) area. All but one school are majority-minority schools, i.e., schools whose
students are mostly from racial or ethnic backgrounds (African American, Hispanic, and
Native American) that are traditionally underrepresented in science in the United States.
We recognize that the unique population and sample in South Texas may limit the external
validity and generalization of our ﬁndings. The instruments were collected at the beginning
of the 2004–2005 school year, and the distribution of boys and girls from each grade level
is shown in Table 1. Reﬂecting the population of teachers participating in our program, the
preponderance of our sample of students was seventh graders (about 12– 13 years old).
Lengthy scales that span several pages are bound to have some missing item responses.
Initial inspection of the data showed an obvious increase in missing data after item 37. That
happened to be the division between the second and third pages of the survey. We decided
that the purposeful omission of the last 20 items could not be regarded as missing-at-random
data, and those 58 cases were removed from all remaining analyses. Of the remaining 1,754
students, 65.3% responded to all questions on the STAQ, 26.9% were missing 1– 2 items,
and 7.8% were missing between 3 and 21 items.
Additional preliminary analyses addressed whether data were missing at random. Later
analyses focused on studying the internal structure (i.e., dimensionality) of the STAQ.
Subsequently, we estimated subscale reliabilities, tested a causal model among STAQ
dimensions, and evaluated whether the model worked equally well for boys and girls and
across grades 6–8. A repeatability test was conducted using the revised 22-item instrument.
1080 OWEN ET AL.
Distribution of Boys and Girls (Number and Column Percent) Completing
Boys Girls Total Missing
Grade N (%) N (%) N (%)
6 172 (9.8) 196 (11.2) 370 (21.1) 2
7 537 (30.6) 602 (34.3) 1141 (65.1) 2
8 112 (6.4) 128 (7.3) 240 (13.7) 0
Missing 0 0 0 3
Total 821 (46.8) 926 (52.8) 1754 (100) 7
Gender and/or grade level were missing for seven students.
grade level were missing for three students.
AMOS 7.0 was used for conﬁrmatory factor models and the structural equation modeling
(SEM). SPSS 15.0 was used for all other analyses.
Our data were multilevel in nature, with students nested in classes, and classes nested
in schools. Such multilevel data often violate the statistical assumption of uncorrelated
errors, which in turn raises the alpha level (e.g., the Type I error rate). The usual approach
to nested data is to use specialized multilevel analyses. The multilevel correction to alpha
levels, however, depends on the intraclass correlation (ICC). With an ICC of .00, there is
no advantage to a multilevel model. We computed ICCs for each of the composite variables
created during later analyses. They were all very small (mean ICC =.005), suggesting that
the added complexity of a multilevel structural model would not have signiﬁcantly altered
Preliminary Study of Missing Data
After removing those 58 respondents who did not answer the ﬁnal page, we studied
whether the remaining data were missing at random. The data set was partitioned into
two groups: those missing one or more responses on any of the 57 items and those with
complete data. Next, a discriminant function analysis was arranged to predict missingness.
Predictor variables were the only other two variables in the data set (grade level and
gender). The discriminant model was signiﬁcant (p=.01). However, Wilkes’ lambda was
0.991 (that is, the effect size was R2=.009), which we judged trivial. Inspection of the
predictor means conﬁrmed this: For the missing group, the average grade level was 6.88;
for the nonmissing group, 6.95. And for the missing group, the average gender value
(boys =0, girls =1) was 0.64; for the nonmissing group, 0.56. The essential randomness
of missing data was important for later analyses, because different analyses treat missing
data differently. Exploratory factor analyses and reliability estimations, for example, use
casewise deletion, where every case with any missing data is automatically discarded.
By contrast, conﬁrmatory factor analysis and structural equation models operate with full
information maximum likelihood, which estimates the covariance matrix with all available
data, and no cases are thrown out. If data are missing at random, then these two approaches
to missing data should not produce results speciﬁc to their sample.
REAPPRAISAL OF THE SIMPSON–TROOST 1081
Dimensionality Evidence for the STAQ
For an initial test of the full Simpson–Troost 14-factor model, a conﬁrmatory factor
analysis (CFA) was arranged with the complete data set. The model was not supported,
with χ2=4,986.8, df =1,234, p< .0001 (should be nonsigniﬁcant; Hu & Bentler, 1995)
and a substandard comparative ﬁt index (CFI)=0.86 (should be at least 0.90; Hu & Bentler,
1995). The root mean square error of approximation (RMSEA), however, was acceptable
at 0.040 (0.90 CI =0.039 – 0.041; p=1.0; should be less than .08 and nonsigniﬁcant; Hu
& Bentler, 1995).
We then moved to an exploratory mode, splitting the overall sample into two approximate
random halves. One sample was named validation (n=848) and the other, cross-validation
(n=906). With the validation data, we staged a series of principal axis factor models with
oblique (direct oblimin) rotation, beginning with a 13-factor model, and descending to a
1-factor model. (The rationale for beginning with the 13-factor model was that the 14-factor
model had already been disconﬁrmed.) We used a loading criterion of 0.35 to identify
salient items. Several of the models contained uninterpretable factors whose content was
not coherent. The research team, seven members who have been studying student attitudes
toward science for the past 2 years, found the 10-factor solution, shown in Table 2, most
interpretable. All subscales had at least three items, and there was minimal cross-loading
on factors. One of the 10 factors was essentially nil (i.e., all item loadings were less than
0.35). We also reviewed and ultimately deleted item 21, which appeared markedly different
from its companion items.
We then applied a reliability test to each of the dimensions: Any subscale data whose co-
efﬁcient alpha was below .70 would beeliminated from further consideration. This criterion
was based on the minimum value suggested for research that compares groups (Nunnally &
Bernstein, 1994). This is a much more stringent standard than used in earlier publications
investigating the STAQ, which reported subscale reliabilities as low as the .30s. However,
we argue that measurement error worms its way through statistical analyses—especially
multivariate methods—and poor measurements in turn weaken a study’s conclusions and
recommendations. By using the item loading criterion of 0.35 and setting a minimum alpha
coefﬁcient of .70 for each subscale, ﬁve revised factors were retained from the exploratory
factor analysis (Factors 1, 2, 3, 4, and 7 in Table 2). We labeled these subscales “Motivating
Science Class” (Factor 1), “Self-Directed Effort” (Factor 2), “Family Models” (Factor 3),
“Science Is Fun for Me” (Factor 4), and “Peer Models” (Factor 7) as shown in Table 3.
Next, we used the cross-validation sample (n=906) for a more stringent conﬁrmatory
factor analysis of the remaining ﬁve dimensions. Except for the exact ﬁt test, χ2(203) =
853.9, p<.001, the remaining ﬁt indices were good, with CFI =0.94 and RMSEA =0.043
(0.90 CI =0.040–0.046). Table 3 gives the standardized loadings for this conﬁrmatory
model. Figure 1 is a visual synopsis of how items on the original STAQ were either dropped
or rearranged to coalesce into the revised shortened structure presented in this paper. Within
the larger original STAQ, we were able to deﬁne a more concise, cohesive set of internally
consistent constructs (“the pearls”) that may be more useful for evaluating curricular and
pedagogical interventions in the classroom.
To explore grade level and gender differences on the revised STAQ dimensions, a
2 (gender) ×3 (grade) MANOVA was arranged with the cross-validation data set (Table 4).
The ﬁve dependent variables were scale scores of the revised STAQ dimensions, created by
averaging across the relevant items. The main effect of gender was signiﬁcant (F5/890 =6.99,
p<.01), but a univariate probe of the effect showed that boys and girls differed only on the
“Self-Directed Effort” subscale (respective means 4.04 and 4.28, p<.01, with a small effect
size, η2=.03). The main effect of grade was also signiﬁcant (F10/1780 =10.25, p<.01).
1082 OWEN ET AL.
Ten-Factor Pattern Matrix for the STAQ from Exploratory Factor Analysis of
the Validation Set (
Alpha .75 .72 .71 .84 .60 .62 .71 .52 .50
SSL 12.25 2.77 2.28 1.27 1.08 0.76 0.64 0.61 0.53 0.40
Notes. Decimals omitted, loadings below 0.35 omitted. SSL =Sum of squared loadings.
Bold numbers indicate items or factors that met the criteria of having item loadings of 0.35
or better and alpha coefﬁcients of .70 or better.
Item deleted because of dual loading.
REAPPRAISAL OF THE SIMPSON–TROOST 1083
Final Five-Factor Revised STAQ Model from Conﬁrmatory Factor Analysis:
Item Stems and Standardized Loadings in the Cross-Validation Set (
“Motivating Science Class” Subscale (Factor 1)
STAQ 17. We do a lot of fun activities in science class. 0.65
STAQ 7. We cover interesting topics in science class. 0.67
STAQ 25. I consider our science classroom attractive and comfortable. 0.59
STAQ 27. My science teacher makes good plans for us. 0.59
STAQ 4. We learn about important things in science class. 0.58
STAQ 5. Our science classroom contains a lot of interesting equipment. 0.56
“Self-Directed Effort” Subscale (Factor 2)
STAQ 57. I always try hard, no matter how difﬁcult the work. 0.73
STAQ 28. I try hard to do well in science. 0.69
STAQ 13. I always try to do my best in school. 0.62
STAQ 29. When I fail that makes me try that much harder. 0.57
“Family Models” Subscale (Factor 3)
STAQ 43. My mother likes science. 0.67
STAQ 47. My father likes science. 0.66
STAQ 55. My brothers and sisters like science. 0.59
STAQ 41. My family watches science programs on TV. 0.54
“Science Is Fun for Me” Subscale (Factor 4)
STAQ 54. I really like science. 0.78
STAQ 31. I have good feelings toward science. 0.73
STAQ 8. I enjoy science courses. 0.73
STAQ 1. I would enjoy being a scientist. 0.58
“Peer Models” Subscale (Factor 7)
STAQ 26. My best friend likes science. 0.74
STAQ 50. My best friend in this class likes science. 0.67
STAQ 2. My friends like science 0.60
STAQ 51. Most of my friends do well in science. 0.47
Items listed in parallel to Table 2 (rank order).
Univariate ANOVAs showed signiﬁcant grade level variation on each of the ﬁve STAQ
subscores, with η2ranging from .02 (“Family Models”) to .08 (“Peer Models”). Tukey
pairwise comparisons showed that each grade was different from each other grade on the
subscales “Motivating Science Class,” “Science Is Fun for Me,” and “Peer Models.” For
the subscales “Family Models” and “Self-Directed Effort,” eighth graders split off from the
other two grades. From Table 4, it can be seen that scores on each subscale were reliably
lower as grade level increased. The average effect size across all ﬁve subscales was 0.048.
Structural Model With the Remaining STAQ Dimensions
The factor intercorrelations from the exploratory factor analysis ranged from .02 to
.51. It occurred to us that the intercorrelations might represent more than simple bivariate
relations—they might represent underlying causal arrangements among the STAQ dimen-
sions. We set about linking the ﬁve remaining dimensions in a hypothesized causal arrange-
ment, depicted in Figure 2. For this work, the cross-validation sample (n=906) was used.
Latent variables are shown in elliptical shapes, and observed variables (the item responses)
1084 OWEN ET AL.
Figure 1. STAQ subscales created through factor analyses. Note that STAQ 21 was removed because its content
was out of alignment with companion items.
are shown in rectangles. The straight arrows represent hypothesized causal inﬂuences. The
coefﬁcients between a latent variable and its indicators are standardized factor loadings.
Values sitting between the latent variables are standardized path coefﬁcients, akin to beta
weights in a regression model. These depict direct inﬂuences of one latent variable on
another. For example, assuming the causal effect is true, the path coefﬁcient of 0.66 from
“Motivating Science Class” to “Self-Directed Effort” suggests that increasing the latent
variable “Motivating Science Class” by 1 standard deviation would result in about two
thirds of a standard deviation improvement in “Self-Directed Effort.”
Two latent variables—“Family Models” and “Motivating Science Class”—were exoge-
nous, that is, they are beginning points with no speciﬁed causes within the model. We
could not justify a causal relationship between those two, so they were agnostically linked
with the curved, double-headed arrow that simply allows them to be correlated. Each of
the two exogenous variables was hypothesized to inﬂuence other variables in the model, as
shown in Figure 2. Importantly, the three latent variables on the left portion of the model
(“Family Models,” “Motivating Science Class,” and “Peer Models”) may be regarded as
antecedent variables, which inﬂuence the two right side variables (“Science Is Fun for Me”
and “Self-Directed Effort”). In other words, the attitude-like “Science Is Fun for Me” and
the attribution variable of “Self-Directed Effort” were hypothesized to be affected by the
important relationships in a student’s life, plus the teacher’s arrangement and structure of
the science class. The items comprising “Science Is Fun for Me” refer to student perceptions
of positive affect (e.g., I really like science; I have good feelings toward science). Some
REAPPRAISAL OF THE SIMPSON–TROOST 1085
Descriptive Summary by Gender and Grade, Cross-Validation Data
Boys Girls ANOVA Effect
Mean Standard Deviation
Mean Standard Deviation
“Motivating Science Class” Grade
6 97 4.03 0.76 95 3.93 0.72 Gender .16 .00
7 286 3.75 0.62 300 3.83 0.59 Grade <.01 .05
8 63 3.37 0.74 59 3.60 0.69 Interaction .08 .01
Total 446 3.76 0.70 454 3.82 0.64
“Self-Directed Effort” Grade
6 97 4.16 0.81 95 4.42 0.62 Gender <.01 .03
7 286 4.10 0.76 300 4.29 0.64 Grade <.01 .04
8 63 3.60 0.87 59 4.06 0.72 Interaction .18 .00
Total 446 4.04 0.81 454 4.28 0.66
“Family Models” Grade
6 97 2.99 0.93 95 2.77 0.98 Gender .09 .00
7 286 2.78 0.85 300 2.66 0.80 Grade <.01 .02
8 63 2.47 0.94 59 2.45 0.74 Interaction .62 .00
Total 446 2.78 0.89 454 2.66 0.84
“Science Is Fun for Me” Grade
6 97 3.66 0.93 95 3.58 0.90 Gender .45 .00
7 286 3.30 0.89 300 3.28 0.86 Grade <.01 .06
8 63 2.68 0.84 59 2.95 0.88 Interaction .19 .00
Total 446 3.30 0.93 454 3.30 0.89
“Peer Models” Grade
6 97 3.14 0.85 95 3.41 0.83 Gender .20 .00
7 286 2.91 0.78 286 2.90 0.72 Grade <.01 .08
8 63 2.53 0.81 59 2.51 0.68 Interaction .08 .01
Total 446 2.90 0.82 454 2.96 0.78
Note. Subgroup sizes do not sum to data set total of 906 because of several cases of missing data on gender or grade.
1086 OWEN ET AL.
Is Fun for Me
Figure 2. Initial structural model of STAQ components.
items within “Motivating Science Class” also refer to positive affect, but with an important
addition: They refer speciﬁcally to teacher-determined lessons (e.g., We do a lot of fun
activities in science class; We cover interesting topics in science class). The logic of the
structural model is that the teacher’s instructional arrangements should strongly inﬂuence
student affect. The opposite causal route—student affect inﬂuencing teacher lessons—
should be less pronounced. Teachers are not oblivious to student affect, but the classroom’s
affective and motivational climate is largely a function of the teacher’s lesson planning and
execution, and management style (Cheng, 1994; Meece, Andermann, & Andermann, 2006;
Turner, Meyer, & Schweinle, 2003).
This structural model was evaluated with the cross-validation sample (n=906). The
test of exact ﬁt for the structural model gave χ2(221) =927.9, p<.001. Indices of close
ﬁt gave more encouragement, thus CFI =0.94 and (RMSEA) =0.043 (0.90 CI =0.040 –
0.046). Three hypothesized causal paths, removed from Figure 2, were nonsigniﬁcant at
p=.05, and were deleted from later analyses. A rerun of the model with the three deleted
paths showed that the trimming did not worsen the overall model (χ2 =930.6,
p<.001; CFI =0.94; RMSEA =0.042 [0.90 CI =0.040–0.045]). Removal of the three
paths changed none of the factor loadings or the remaining path coefﬁcients.
The factor loadings were all strong, with the lowest at 0.47. This implies that the
latent variables had acceptable indicators, and it conﬁrmed the ﬁve strongest factors from
the exploratory factor analysis with the validation data set. Interestingly, in this trimmed
model, three variables (especially “Motivating Science Class”) had important inﬂuence on
“Science Is Fun for Me.” But “Science Is Fun for Me” was disconnected from self-reported
behavior, represented by “Self-Directed Effort.” And although “Family” and “Peer Models”
affect “Science Is Fun for Me,” they had no inﬂuence on “Self-Directed Effort.”
REAPPRAISAL OF THE SIMPSON–TROOST 1087
Further Evaluation of Gender Differences
Substantial earlier work (e.g., American Association of University Women Education
Foundation, 1992; National Coalition for Women and Girls in Education, 2002; U.S. De-
partment of Education, National Center for Education Statistics, 2004) has reported gender
differences in attitudes toward science and many other school variables. We created a multi-
group analysis model, using the cross-validation sample, to test gender differences in the
trimmed structural model. The multigroup approach is typically a series of increasing con-
straints on the model. In the ﬁrst stage, boys’ and girls’ unconstrained models are analyzed
simultaneously, the only requirement being that the factors have the same indicators (but not
the same loadings), and the causal arrows linking the latent variables are the same (but not
the same path coefﬁcients). In the next stage, an additional requirement is added that factor
loadings are equivalent for boys and girls. Each successive constraining condition is nested
in the previous one, so changes in successive models may be tested with a simple χ2test.
The ﬁrst of the multigroup series showed the unconstrained comparison to ﬁt closely,
although the exact ﬁt test was signiﬁcant (χ2 =805.8, p<.001; CFI =0.93;
RMSEA =0.030 [0.90 CI =0.029 – 0.035]). When the next model forced the factor load-
ings to be equal, there was an insigniﬁcant change in exact ﬁt (χ2 =20.5, p=.25).
Thus, the measurement of the ﬁve revised STAQ latent variables was invariant across
gender. However, each additional constrained model was signiﬁcantly different from the
immediately preceding model. Of most interest were the differences in “structural weights,”
i.e., the path coefﬁcients. The variations are shown in Figure 3 (girls) and Figure 4 (boys);
these are the models in which the path coefﬁcients were free to vary across gender. One
notable difference in magnitude appears between “Family Models” and “Science Is Fun for
Me” (standardized coefﬁcient =.34 and .20 for girls and boys, respectively). This implies
Is Fun for Me
Figure 3. Girls’ structural model with factor loadings constrained.
1088 OWEN ET AL.
Is Fun for Me
Figure 4. Boys’ structural model with factor loadings constrained.
that, compared to boys, family members of girls play a stronger role in affecting girls’
beliefs about whether science is fun. However, this gender difference should not obscure
the far larger effect of “Motivating Science Class” on “Science Is Fun for Me.”
A second obvious difference shows up between “Motivating Science Class” and “Self-
Directed Effort” (.56 and .69 for girls and boys, respectively). This suggests that the
teachers’ efforts to arrange the science class environment have a larger impact on boys’
effort than on girls’ effort. But as before, we emphasize that the gender difference is
fairly small in view of the larger principle: Among the various motivational aspects of this
structural model, it is the environment of the science class, arranged by the science teacher,
that delivers a large effect on self-reported student effort. Signiﬁcantly, the science teacher
wields considerable control over class environment, choice of activities, and instructional
An additional impact of “Motivating Science Class” is not immediately obvious from
the girls’ and boys’ structural models of Figures 3 and 4. The total standardized effect of
“Motivating Science Class” on “Science Is Fun for Me” was nearly the same for girls (0.66)
and for boys (0.68). The total effect is the sum of the direct link between these two latent
variables, plus the indirect route mediated through “Peer Models.” In the girls’ model, the
total effect is decomposed into 0.07 indirect and 0.59 direct effects; in the boys’ model it
is decomposed into 0.23 indirect and 0.45 direct. In other words, “Peer Models” was a far
more important mediator for boys than for girls.
Further Evaluation of Grade-Level Differences
In the earlier MANOVA, conspicuous grade-level differences were seen for each of
the STAQ subscales. For another perspective on this, we used the cross-validation data to
REAPPRAISAL OF THE SIMPSON–TROOST 1089
perform a multigroup analysis across the three grade levels. The arrangement was similar to
the gender multigroup model, with a series of increasing constraints following a completely
unconstrained model. However, even the baseline unconstrained model showed substandard
ﬁt, with (χ2 =1188.0, p<.001; CFI =0.89; RMSEA =0.033 [0.90 CI =0.030 –
0.035]). Given a poor start and the fact that approximately 65% of the sample was composed
of seventh graders, there is not much merit in assessing the sequential constraints (although
each was signiﬁcantly different from each preceding model). Evidently, the systematic
and widespread grade-level differences seen in the MANOVA reverberated through the
multigroup structural model as well.
Reliability Evidence for the Revised STAQ Subscales
In general, reliability asks about score consistency. Because consistency has various
deﬁnitions (e.g., consistency over time; consistency among items), it is helpful to calculate
more than one reliability estimate to see how consistent the coefﬁcients themselves are. The
classic reliability coefﬁcient is calculated as Cronbach’s alpha, which evaluates consistency
among items. However, Hancock and Mueller (2001) have argued that alpha is really an
estimate for a composite of observed variables (e.g., item responses). They pointed out that
in structural equation modeling, one might instead be interested in “construct reliability,”
which may be calculated as coefﬁcient H, a direct reliability estimate of the latent variable.
Following Hancock and Mueller’s formulas for coefﬁcient H, Table 5 shows coefﬁcient H
values for each of the ﬁve Simpson–Troost latent variables, as well as classic Cronbach
alphas and stability coefﬁcients.
The stability data came from an independent data set of an additional 186 youngsters,
which included 68 ninth and tenth graders (approximately 14–16 years old). We wanted
to see whether the middle school reliability evidence generalized to high schoolers. These
students completed the STAQ on two occasions 1 week apart.
Several summary results from the reliability data are clear. First, the reliability estimates
are adequate, running mostly in the .70s. The “Science Is Fun for Me” dimension was de-
pendably more consistent, and this is also supported by higher 1-week stability coefﬁcients.
Second, there was very little difference between coefﬁcients H and alpha. And third, differ-
ences between boy and girl reliabilities were negligible. However, when stability data were
Reliability Estimates for Revised Simpson–Troost Subscales
Construct (H) Classic (alpha) Stability
Variable Boys Girls Boys Girls Boys Girls School School
“Motivating Science .77 .79 .78 .79 .82 .77 .74 .77
“Self-Directed Effort” .75 .71 .75 .71 .78 .82 .81 .79
“Family Models” .72 .71 .74 .70 .77 .84 .77 .87
“Science Is Fun for Me” .82 .84 .79 .81 .85 .87 .84 .90
“Peer Models” .71 .77 .70 .73 .77 .79 .72 .90
For 1-week stability estimates, boys
=100, middle school =118, high
1090 OWEN ET AL.
examined by school level, high school student data generally showed more consistency
over time, compared to middle school data.
We accomplished three aims in this paper. The ﬁrst was to conduct a psychometric review
and reappraisal of the STAQ using an independently collected data set from a sample of
middle school students (about 11–14 years old). In our analyses, the original 14 subscales
did not ﬁt a cohesive structure. Instead, using exploratory and conﬁrmatory factor analyses,
requiring a factor loading of 0.35 per item, and a subscale Cronbach’s alpha of .70, we
were able to derive a ﬁve-factor revised STAQ. This model reduced the number of items
from 57 to 22. Reducing the items on any scale while retaining or improving data reliability
and validity has two practical virtues. Shortened instruments (a) require less class time to
administer and (b) reduce the problem of missing data.
The second aim was to determine whether the ﬁve-factor revised STAQ would support a
putative structural equation model of inﬂuences on science affect and self-directed effort.
The model ﬁt the data and demonstrated that “Motivating Science Class” has strong asso-
ciations with science affect (“Science Is Fun for Me”) and “Self-Directed Effort.” Family
models and peer models were associated with science affect, but not effort. The model ﬁt
equally well for boys and girls with moderate differences between two of the speciﬁed
paths. However, there were sizeable grade-level differences among the ﬁve revised STAQ
Although many variables related to science education are under only partial control
of educators, the models generated in this study suggest that creating a motivating, fun
science class has greater impact on student affect (“Science Is Fun for Me”) and effort
(“Self-Directed Effort”) than family or peer variables. This is a compelling afﬁrmation
(and responsibility) for teachers to know that their effectiveness in the classroom exerts
considerable inﬂuence. The subscale “Motivating Science Class” consists of items related
to the physical environment of the classroom, the teacher, and the curriculum. This subscale
has strong associations with both student perceptions of science and student effort, afﬁrming
the importance of the classroom teacher. “Family Models” have modest associations with
student perception that science is fun, while “Peer Models” have about the same effect.
Neither “Family Models” nor “Peer Models” exerted signiﬁcant effect on “Self-Directed
Effort.” Interestingly, the causal relationships hypothesized in the model afﬁrm Simpson
and Oliver’s ﬁnding (1990) that classroom variables had the strongest impact on student
affect. Simpson and Oliver (1990) discovered that gender differences were not as signiﬁcant
as hypothesized, which was also true of our data.
The third aim was to assess the stability of the ﬁve-factor STAQ. This was accomplished
in a separate sample of 186 students and showed repeatability estimates ranging from 0.70
to 0.90. While acceptable, there is still room for improvement (see Table 5). Thus, the
groundwork is laid for additional research into the ability of this model to predict student
achievement over time and perhaps student participation in science over time.
Although we noted that structural equation modeling accounts for measurement error in
calculating parameter estimates, that does not itself justify the continued use of psychome-
trically substandard measures. Classic parametric procedures (e.g., ANOVA, regression)
are bound by the impossible assumption that measurements have zero measurement error.
To the extent measurement error exists, those sorts of analyses will deliver biased results.
In simple bivariate analyses, such as a correlation, there is an exactly known downward
bias. But in a multivariate approach, measurement error ripples through the entire set of
REAPPRAISAL OF THE SIMPSON–TROOST 1091
measures, and results are unpredictably biased upwardly or downwardly (Owen & Froman,
2005). Such biases can readily distort researchers’ conclusions.
Theoretical Basis for Hypothesized Structural Model
By its nature, science education is a dynamic interaction of variables related to learners,
educators, schools, science, and society. The literature generally assumes that childhood
experiences greatly inﬂuence both academic interest and achievement (e.g., Simpson &
Oliver, 1990). However, some relevant inﬂuences, such as personal motivation, and home
and peer variables, are only under partial control of educators (Kremer & Walberg, 1981).
Children and adolescents imitate those they respect and admire and with whom they identify.
Typically, those are models with whom they have the most contact, such as parents, siblings,
peers, and perhaps older friends (Owen, Froman, & Moscow, 1981). The norms displayed by
such groups help to steer students’ behavior and attitudes (Fishbein & Ajzen, 1975). There
is thus some theoretical and empirical support for family and peer inﬂuences on student
perceptions about science (“Science Is Fun for Me”) as shown in the hypothesized structural
Although individual and home inﬂuences strongly affect attitudes toward science, their
inﬂuence is mediated by the science classroom (Simpson & Oliver, 1990). Talton and
Simpson (1986) found that classroom environmental variables were the “strongest predic-
tors of attitudes toward science in grade levels 6–10” (p. 366), and teachers had the largest
single impact on classroom variables (Simpson & Oliver, 1990). When a teacher is ani-
mated, enthusiastic, and excited about the subject at hand, students experience emotional
arousal that can change behavior (Bandura, 1986). In contrast, a facile, disengaged, and
critical teacher can crucially blunt student attitudes and interest. Thus, teachers help to
determine student behavior and attitudes toward school science. This important role of the
classroom teacher is reﬂected in the model as “Motivating Science Class” (arranged by the
teacher) and is positioned as an antecedent to the attitudinal “Science Is Fun for Me” and
the behavioral “Self-Directed Effort.”
Whether attitudes inﬂuence behavior, or the other way around, has been notably difﬁcult
to establish, in some part because of imprecise deﬁnition of terms (Simpson et al., 1994).
Behavior may be inﬂuenced by unmeasured attitudes that exert a stronger effect than
measured ones, and behavior may seem inconsistent with a given attitude if it is due
to a stronger, hidden motivation, or modiﬁed by expected and unexpected consequences
(Osborne et al., 2003). It is also important to make a distinction between attitudes toward
science in general and attitudes toward school science; the latter may be a better predictor
of student behavior (Osborne et al., 2003). Thus the items contained in the “Science Is Fun
for Me” scale may be revised in future studies to more clearly deﬁne the construct and may
result in a more clearly deﬁned interaction with student “Self-Directed Effort.”
Programs directed at increasing achievement motivation have not been generally suc-
cessful (Elliott & Bempechat, 2002; Elliott & Huffton, 2003) and yet motivating students
toward classroom goals is a major responsibility of teachers (Owen et al., 1981). Simpson
and Oliver (1990) found that attitudes toward science, science self-concept, and achieve-
ment motivation, all self-related factors, were strong predictors of student achievement
in science. The revised STAQ subscales “Science Is Fun for Me” and “Self-Directed
Effort” consist of items from these original subscales, as shown in Figure 1. The theo-
retical relationship among these factors may point the way toward further revision of the
1092 OWEN ET AL.
Limitations of the Study
The original STAQ was intended to measure student attitudes toward science. In our data,
we discovered that a subset of STAQ items seems to assess additional constructs related to,
but different from, attitudes. For example, such constructs as “Peer Models” and “Family
Models” plainly focus on perceptions about peer and family behaviors rather than personal
attitudes. It is the task of additional research to study whether these restructured subscales
are better than, or redundant with, existing measures. In short, our dimensionality analyses,
and the subsequent elimination of certain items, have changed the nature of the STAQ. This
is an obvious limitation, in that our results cannot readily be compared to earlier STAQ
research. But it is also a psychometric enhancement, in that the original STAQ produced
It is risky to assert causal relationships with cross-sectional data, especially when claimed
causes and consequences are measured at the same time. We acknowledge that our causal
links are simply hypotheses. As such, SEM allows them to be disconﬁrmed but not con-
ﬁrmed. Unfortunately, there is little hope of creating a true experimental design with such
data, because student perceptions cannot be randomly assigned. We recommend future
work with a longitudinal approach, where constructs are assessed on at least two occasions.
This sort of research design offers a stronger approach to evaluating causal claims.
In our data, we found increasingly lower self-reported scores as grade level increased
from sixth through eighth. Our ﬁnding roughly parallels that of Reid and Skryabina (2003),
who used a large national data set from Scotland. They suggested that schools and teachers
think about how to make science—and the teaching of science—more interesting and
With cross-sectional data, it is incorrect to describe grade-level differences as showing
“declining” attitudes. However, George (2000) used a national longitudinal database that
authentically documented declines over time in student attitudes toward science, conﬁrming
the work of Simpson and Oliver (1990) and Kaplan (2000). George also found that boys’
attitudes declined more sharply over time than those of girls. But because those attitude
measures were linked to speciﬁc science courses, there was no way to tell whether attitudes
declined generally, or whether taking speciﬁc courses (e.g., physics) was the inﬂuence.
We also point out that the sophistication and power of SEM is no substitute for logical
thinking and convincing arguments about causes and consequences. It is well known,
even with statistical evidence favoring a given model, that there are alternate (and usually
untested) models that ﬁt the data just as well. Of course, researchers cannot evaluate
the entire universe of alternate models, and must eventually summarize a tentative but
plausible one. Additional research—with new data—is necessary to replicate or reﬁne a
model of student attitudes toward science. And because all of our data refer to various
student perceptions, future research should include some objective measure of student
achievement, which is widely viewed as the most important outcome of schooling.
Finally, we remind readers that our data came from a restricted geographical and cultural
area, in south Texas and with schools whose populations are about 80% Mexican American.
Whether our ﬁndings generalize to other ethnicities or areas (including countries) remains
a topic for further research.
Our restructuring of the STAQ into ﬁve dimensions did not produce a conclusive instru-
ment. The multigroup structural model for gender was promising in that the factor loadings
could be forced to equality without harming the overall model ﬁt. But the multigroup model
REAPPRAISAL OF THE SIMPSON–TROOST 1093
for grade level, even with no constraints imposed, did not show satisfactory ﬁt. There are
thus additional questions about how much tinkering with items and subscales is needed to
show measurement invariance across grades 6, 7, and 8. We suggest caution to researchers
who use the revised STAQ with samples that span two or more grade levels.
We urge researchers who intend to use selected revised Simpson–Troost subscales to con-
sider improving their internal consistency reliabilities. One way to do this is by lengthening
the scales (Nunnally & Bernstein, 1994). This leads to an important research decision, be-
cause scale revision demands not only writing additional items but also pilot testing prior to
the main study. Researchers need to consider the trade-offs between instrument length, fea-
sibility of administering the instrument, and data completeness. Investigators must choose
instruments that adequately assess the study variables and defend their choices. If more
researchers chose such rigor, collectively, they would contribute to the development of more
dependable instruments for all (Blalock et al., 2008). In an environment that increasingly
stresses best-evidence practices and therefore effective evaluation strategies, it is imperative
that the research community take time to collect, analyze, and share extant scale data that
will ultimately lead to better assessment tools.
Structural equation analyses provided evidence for strong associations among the class-
room environment, student effort, and science affect. Classrooms are generally under the
management and control of the teacher and school leadership. Our ﬁndings raise the
potential for teacher professional development to improve the classroom and activities,
enhance student “Self-Directed Effort,” and presumably improve science achievement. To
test this will require longitudinal data collection in controlled studies. The revised STAQ
provides a feasible instrument to evaluate the dynamic relationships of these variables over
This work was supported by Science Education Partnership Award R25-RR-18549 (National Center
for Research Resources [NCRR] and the National Institute on Aging [NIA]), a Minority K-12
Initiative for Teachers and Students Grant R25-HL-075777 (National Heart, Lung, and Blood Institute
[NHLBI]) and MO1-RR-01346 for the Frederic C. Bartter General Clinical Research Center. The
NCRR, NIA, and NHLBI are all part of the National Institutes of Health (NIH).
Wethank the administration, teachers, and students from Edgewood ISD, Northside ISD, NorthEast
ISD, San Antonio ISD, and South San Antonio ISD in San Antonio, Texas, and Hondo ISD in Hondo,
Texas. Our special thanks go to Linda Bononcini, assistant superintendent Edgewood ISD; George
Colon, principal and Teresa Gatell, language arts teacher at Truman MS; From Northside ISD thanks
to, Dr. Phil Linerode, evaluation specialist; Alice Fiedler, science curriculum specialist; John Folks,
superintendent; Priscilla Shaver, gifted/talented coordinator; Javier Martinez, principal and Lorenda
Segura, science teacher at Zachry MS; Lynn Pierson, principal, and Jose Luna, special education
teacher at Luna MS; From NorthEast ISD, we thank Richard Middleton, superintendent; Alicia
Thomas, associate superintendent; Mark Shefﬂer, associate superintendent; Don Dalton, executive
director, curriculum & instruction; Pattie Castellano, science curriculum coordinator; Francene Tharp,
health services; Randy Hoyer, principal and Tamisine Neal and Melissa Moody, science teachers
from Bush MS. From San Antonio ISD, our thanks go to Ruben Olivares, superintendent; Bill Vinal,
science director; Anita Chavera, principal, Josephine Rose, language arts teacher, A. Ahrenholtz,
M. Dovalina, J. Garza, E. Jordan, Carla Maldonado, C. Smith, John Stauffer, Adah Stock, and L.
Thomas, science teachers from Irving MS; Armando Gutierrez, principal, Elizabeth Aguilar-Cruz,
Deborah Friesenhan, and Francisco Lara, science teachers from Lowell MS. From South San Antonio
ISD, our thanks go to Evelyn Trinidad, director for mathematics and science; Thomas Fonseca,
principal and Elsa Gonzalez and Mary Ramos, science teachers at Dwight MS. From Hondo ISD, our
thanks go to Clyde Parsons, superintendent; Charles Carlson, assistant superintendent; Larry Carroll,
principal and Lee Ann Young and Cora Rothe, science teachers at Hondo HS.
We thank Olivia Lemelle for her graphic design assistance.
1094 OWEN ET AL.
American Association of University Women Educational Foundation. (1992). How schools short-change girls:
Executive summary. Washington, DC: Author.
Atwater, M. M., Wiggins, J., & Gardner, C. M. (1995). A study of urban middle school students with high and
low attitudes toward science. Journal of Research in Science Teaching, 32, 665 – 677.
Bandura, A. (1986). Observational learning. In A. Bandura (Ed.), Social foundations of thought and action: A
social cognitive theory (pp. 47– 105). Englewood Cliffs, NJ: Prentice-Hall.
Blalock, C. L., Lichtenstein, M. J., Owen, S. V., Pruski, L. A., Marshall, C. E., & Toepperwein, M. A. (2008). In
pursuit of validity: A comprehensive review of science attitude instruments 1935 – 2005. International Journal
of Science Education, 30, 961 – 977.
Cannon, R. K. (1983). Relationships among attitude, motivation, and achievement of ability grouped, seventh-
grade, life science students. Dissertation Abstracts International, 44, 1408. (UMI No. 8320073)
Cannon, R. K., & Simpson, R. D. (1985). Relationships among attitude, motivation, and achievement of ability
grouped, seventh-grade life science students. Science Education, 69, 121 – 138.
Cheng, Y. C. (1994). Classroom environment and student affective performance: An effective proﬁle. Journal of
Experimental Education, 62, 221 – 238.
Elliott, J. G., & Bempechat, J. (2002). The culture and contexts of achievement motivation. New Directions for
Child and Adolescent Development, 96, 7 – 26.
Elliott, J. G., & Huffton, N. (2003). Achievement motivation in real contexts. British Journal of Educational
Psychology Monograph Series II, Number 2—Development and Motivation, 1, 155 – 172.
Fishbein, M., & Ajzen, I. (1975). Belief, attitude, intention and behavior: An introduction to theory and research.
Reading, MA: Addison-Wesley.
Fraser, B. J. (1977). Selection and validation of attitude scales for curriculum evaluation. Science Education, 61,
317 – 330.
Gardner, C. (1992). Inﬂuences on attitudes toward science of nontraditional students. Dissertation Abstracts
International, 53, 3161. (UMI No. 9301195)
Gardner, P. L. (1975). Attitude measurement: A critique of some recent research. Educational Research, 17,
101 – 109.
George, R. (2000). Measuring change in students’ attitudes toward science over time: An application of latent
variable growth modeling. Journal of Science Education and Technology, 9, 213– 225.
Greenﬁeld, T. A. (1997). Gender and grade level differences in science interest and participation. Science Educa-
tion, 81, 259 – 275.
Hancock, G. R., & Mueller, R. O. (2001). Rethinking construct reliability within latent variable systems. In
R. Cudeck, S. du Toit, & D. S¨
orbom (Eds.), Structural equation modeling: Present and future—A Festschrift
in honor of Karl J¨
oreskog (pp. 195 – 216). Lincolnwood, IL: Scientiﬁc Software International, Inc.
Hill, G. D., Atwater, M. M., & Wiggins,J. (1995). Attitudes toward science of urban 7th grade life-science students
over time, and the relationship to future plans, family, teacher, curriculum and school. Urban Education, 30,
71 – 92.
Hu, L. T., & Bentler, P. (1995). Evaluating model ﬁt. In R. H. Hoyle (Ed.), Structural equation modeling. Concepts,
issues, and applications (pp. 76 – 99). Thousand Oaks, CA: Sage.
InterAcademy Panel on International Issues. (2008). Science education. Retrieved February 12, 2008 from
Kaplan, D. (2000). Structural equation modeling: Foundations and extensions. Thousand Oaks, CA: Sage.
Khalili, K. Y. (1987). A crosscultural validation of a test of science related attitudes. Journal of Research in
Science Teaching, 24, 127 – 136.
Koballa, T. R. (1988). Attitude and related concepts in science education. Science Education, 72, 115 – 126.
Kremer, B. K., & Walberg, H. J. (1981). A synthesis of social and psychological inﬂuences on science learning.
Science Education, 65, 11 – 23.
Lichtenstein, M. J., Owen, S. V., Blalock, C. L., Liu, Y., Ramirez, K. A., Pruski, L. A. et al. (2008). Psychometric
re-evaluation of the Scientiﬁc Attitude Inventory-Revised (SAI-II). Journal of Research in Science Teaching,
45, 600 – 616.
Lin, B., & Crawley, F. E. (1987). Classroom climate and science-related attitudes of junior high school students
in Taiwan. Journal of Research in Science Teaching, 24, 579 –591.
Maidon, C. H. (2001). A comparison of a ﬁfth-grade elementary school science research-based curriculum
and an activity-centered traditional curriculum: effects on conceptual knowledge, process skills and attitude.
Dissertation Abstracts International, 63, 2497. (UMI No. 3059912)
Marshall, C. E., Blalock, C. L., Liu, Y., Pruski, L. A., Toepperwein, M. A., Owen, S. V., et al. (2007). Psychometric
re-evaluation of the Image of Science and Scientists Scale (ISSS). School Science and Mathematics, 107, 149–
REAPPRAISAL OF THE SIMPSON–TROOST 1095
Meece, J. L., Andermann, E. M., & Andermann, L. H. (2006). Classroom goal structure, student motivation, and
academic achievement. Annual Review of Psychology, 57, 487 –503.
Munby, H. (1980). An evaluation of instruments which measure attitudes to science. In C. P. McFadden (Ed.),
Worldtrends in science education (pp. 226 – 275). Halifax, Nova Scotia, Canada: Atlantic Institute of Education.
National Academy of Sciences. (2006). Rising above the gathering storm: Energizing and employing America for
a brighter economic future. Washington, D.C.: The National Academies Press.
National Coalition for Women and Girls in Education. (2002). Title IX at 30: Report card on gender equity.
Washington DC: Author.
National Science Teachers Association. (1996). Position statement on international science education. Retrieved
on February 12, 2008, from http://www.nsta.org/about/positions/international.aspx
Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill.
Oliver, J. S. (1986). A longitudinal study of attitude, motivation and self-concept as predictors of achievement in
and commitment to science among adolescent students. Dissertation Abstracts International, 47, 2983. (UMI
Oliver, J. S., & Simpson, R. D. (1988). Inﬂuences of attitude toward science, achievement motivation, and science
concept on achievement in science: A longitudinal study. Science Education, 72, 143 – 155.
Osborne, J., Simon, S., & Collins, S. (2003). Attitudes towards science: A review of the literature and its
implications. International Journal of Science Education, 25, 1049 – 1079.
Owen, S. V., & Froman, R. D. (2005). Why carve up your continuous variables? Research in Nursing & Health,
28, 496 – 503.
Owen, S. V., Froman, R. D., & Moscow, H. (1981). Educational psychology (2nd ed.). Boston: Little, Brown &
Owen, S. V., Toepperwein, M. A., Pruski, L. A., Blalock, C. L., Liu, Y., Marshall, C. E., et al. (2007). Psychometric
reevaluation of the Women in Science Scale (WiSS). Journal of Research in Science Teaching, 44, 1461–1478.
Polit, D. F., Beck, C. T., & Owen, S. V. (2007). Is the CVI an acceptable indicator of content validity? Appraisal
and recommendations. Research in Nursing & Health, 30, 459 – 467.
Polit, D. F., Beck, C. T., & Owen, S. V. (in press). Item and scale content validity: Clariﬁcation and recommen-
dations. Research in Nursing & Health.
Reid, N., & Skryabina, E. (2003). Gender and physics. International Journal of Science Education, 5, 509 – 536.
Rennie, L. J., & Punch, K. F. (1991). The relationship between affect and achievement in science. Journal of
Research in Science Teaching, 28, 193 – 209.
Schibeci, R. A., & McGraw, B. (1981). Empirical validation of the conceptual structure of a test of science-related
attitudes. Educational and Psychological Measurement, 41, 1195 – 1201.
Simpson, R. D., Koballa, T. R., Oliver, J. S., & Crawley, F. E. (1994). Research on the affective dimension of
science learning. In D. Gabel (Ed.), Handbook of research in science teaching and learning (pp. 211 – 234).
New York: Macmillan.
Simpson, R. D., & Oliver, J. S. (1985). Attitude toward science and achievement motivation proﬁles of male and
female students in grades six through twelve. Science Education, 69, 511–526.
Simpson, R. D., & Oliver, J. S. (1990). A summary of major inﬂuences on attitude toward and achievement in
science among adolescent students. Science Education, 74, 1 – 18.
Simpson, R. D., & Troost, K. M. (1982). Inﬂuences on commitment to and learning of science among adolescent
students. Science Education, 66, 763 – 781.
Spellman, J. E., & Oliver, J. S. (2001). The relationship between attitude toward science with enrollment in a 4 ×
4 block schedule. Proceedings of Annual Meeting of the Association for the Education of Teachers in Science
(pp. 3 – 30), Costa Mesa, CA. 3 – 30.
Talton, E. L., & Simpson, R. D. (1985). Relationships between peer and individual attitudes toward science among
adolescent students. Science Education, 69, 19 – 24.
Talton, E. L., & Simpson, R. D. (1986). Relationships of attitudes toward self, family, and school with attitude
toward science among adolescents. Science Education, 70, 365– 374.
Talton, E. L., & Simpson, R. D. (1987). Relationships of attitude toward classroom environment with attitude
toward achievement in science among adolescent tenth grade students. Journal of Research in Science Teaching,
24, 507 – 525.
Turner, J. C., Meyer, D .K., & Schweinle, A. (2003). The importance of emotion in theories of motivation:
Empirical, methodological, and theoretical considerations from a goal theory perspective. International Journal
of Educational Research, 39, 375 – 393.
U.S. Department of Education, National Center for Education Statistics. (2004). Trends in educational equity of
girls & women: 2004 (NCES 2005-016). Washington, DC: U.S. Government Printing Ofﬁce.
Wyer, M. (2003). Intending to stay: Images of scientists, attitudes toward women and gender as inﬂuences
on persistence among science and engineering majors. Journal of Women and Minorities in Science and
Engineering, 9, 1 – 16.