ArticlePDF Available

Reexamining the Role of Vision in Second Language Motivation: A Preregistered Conceptual Replication of You, Dörnyei, and Csizér (2016)


Abstract and Figures

Researchers have linked vivid mental imagery, particularly of the self in future states, to many desirable motivational outcomes for language learning. We report a preregistered conceptual replication and extension of You, Dörnyei, and Csizér (2016), who found a central motivational role for vision. We review essential considerations in structural equation modeling and discuss how the initial study addressed these, then describe a conceptual replication with a South Korean sample of secondary school learners of English ( N = 1,297). Our analysis of the scales from the initial study in addition to second language achievement found support for an alternative model where the Intended Effort scale showed a better fit as a predictor of motivation than as an outcome variable. Our findings suggest the need for greater precision and rigor in structural equation modeling research on second language learning motivation and for more language researchers to take up replication and other open science initiatives. Open Practices This article has been awarded Open Data, Open Materials, and Preregistered Research Design badges. All data and materials, along with preregistration for research design and analyses, are publicly accessible through the Open Science Framework at . The study materials are also publicly available via the IRIS database at . Learn more about the Open Practices badges from the Center for Open Science: .
Content may be subject to copyright.
Language Learning ISSN 0023-8333
Reexamining the Role of Vision in Second
Language Motivation: A Preregistered
Conceptual Replication of You, D ¨
and Csiz ´
er (2016)
Phil Hiver aand Ali H. Al-Hoorie b
aFlorida State University and bRoyal Commission for Jubail and Yanbu
Researchers have linked vivid mental imagery, particularly of the self in future states, to
many desirable motivational outcomes for language learning. We report a preregistered
conceptual replication and extension of You, D¨
ornyei, and Csiz´
er (2016), who found
a central motivational role for vision. We review essential considerations in structural
equation modeling and discuss how the initial study addressed these, then describe
a conceptual replication with a South Korean sample of secondary school learners of
English (N=1,297). Our analysis of the scales from the initial study in addition to second
language achievement found support for an alternative model where the Intended Effort
scale showed a better fit as a predictor of motivation than as an outcome variable. Our
findings suggest the need for greater precision and rigor in structural equation modeling
research on second language learning motivation and for more language researchers to
take up replication and other open science initiatives.
We would like to thank the five reviewers for their thoughtful and constructive comments. We also
thank the editorial team of Pavel Trofimovich, Kara Morgan-Short, and Emma Marsden for their
assistance throughout the review process. Special thanks must also go to Chenjing (Julia) You for
her time reading and commenting on a previous version of this manuscript.
This article has been awarded Open Data, Open
Materials, and Preregistered Research Design
badges. All data and materials, along with preregis-
tration for research design and analyses, are publicly
accessible through the Open Science Framework at The study materials are
also publicly available via the IRIS database at Learn more about
the Open Practices badges from the Center for Open Science:
Correspondence concerning this article should be addressed to Phil Hiver, Florida State Univer-
sity, School of Teacher Education, College of Education, 1114 W. Call St., G128 Stone Building,
Tallahassee, FL 32306, United States. E-mail:
Language Learning 70:1, March 2020, pp. 48–102 48
C2019 Language Learning Research Club, University of Michigan
DOI: 10.1111/lang.12371
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Keywords motivation; vision; replication; preregistration; structural equation model-
ing; gender; second language
Replication of research findings has recently been recognized as an important
component of cumulatively refining empirical evidence in the social sciences,
including the field of language learning (Marsden, Morgan-Short, Thompson,
& Abugaber, 2018; Morgan-Short et al., 2018; Porte, 2012). Mackey and Gass
(2005) defined replication as “[c]onducting a research study again, in a way
that is either identical to the original procedure or with small changes . . .
to test the original findings” (p. 364), with the successful replication of a
finding lending further legitimacy to the initial study (Hiver & Al-Hoorie,
2016). A replication may be direct, when researchers hold constant design,
methods, and analysis, or partial, when researchers intentionally introduce one
significant change to a key variable to test replicability under the new condition
or procedure (Marsden, Morgan-Short, Thompson, & Abugaber, 2018). In
contrast, conceptual replications, like the current study, introduce more than
one significant change to an initial study and can provide information about the
potential value of different approaches to investigating a problem or different
conceptualization of a construct.
A subdomain of language learning research where replication research
is needed is language learning motivation. In recent literature in the second
language (L2) motivation field, vision and mental imagery1have received con-
siderable attention, with greater numbers of researchers showing interest in
the empirical assessment (see Al-Hoorie, 2018) and practical application (e.g.,
ornyei & Kubanyiova, 2014; Hadfield & D¨
ornyei, 2013) of vision in facilitat-
ing language learning. For example, You, D ¨
ornyei, and Csiz´
er (2016) conducted
a study that they described as “the first to offer a broad overview of the extent
to which the capacity of vision contributes to the overall motivational setup of
a whole language learning community” (p. 94). Their study with a large-scale
Chinese sample reported findings related to a number of motivation constructs,
including vision, sensory style, positive and negative change in L2 self im-
age, and gender. In evaluating their results, You et al. concluded that their
study offered unambiguous support for the role of vision in language learning
motivation (cf. pp. 113 and 120).
However, as our review of the literature shows, there is also evidence sug-
gesting that the effect of vision on effort and performance is open to question.
Thus, in order to refine our understanding of the extent to which vision can
49 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
be said to contribute to motivation for L2 learning, the first purpose of the
present study was to attempt to conceptually replicate the structural equation
model (SEM) used by You et al. (2016). Replications of studies using SEM,
of any kind, have tended to be rare. Perhaps because SEM typically requires
large samples, most of the SEM literature has consisted of “one-shot studies”
for which the researchers have not attempted replications (Kline, 2016, p. 121).
This situation has often been further complicated in instances where SEM has
been used to engage in post hoc, data-driven model modification, rendering
the results exploratory and potentially less likely to replicate. Because SEM is
prone to these and other misapplications, a second equally important purpose
of the present article was to provide an up-to-date methodological review of
considerations that SEM users must take into account when they analyze data
and report results.
Following best practices in replication research (Brandt et al., 2014), we first
closely examined the purpose, design, and methodology of You et al.’s (2016)
study to consider the rationale for our replication and then preregistered our
design and analysis plan. We identified a clear need for some kind of replication,
given the impact of the study on the field—with over 65 Google Scholar
citations in less than 3 years, a number that is higher than the average annual
citations of 17.65 that initial studies receive before being replicated (Marsden,
Morgan-Short, Thompson, & Abugaber, 2018, p. 347)—and given the broader
societal importance of better understanding motivation in language learning.
Furthermore, and in line with the main focus of the rationale that we have
described here, we identified a number of statistical concerns related to model
fit, assumption checking, measurement model validation, model justification,
measurement invariance, and hypothesis testing. To address these concerns,
we deviated from the methodology of You et al. (2016), thus making ours
a conceptual replication rather than a direct or partial replication. We also
included some exploratory elements that deviated from our own preregistration.
We emphasize that we are genuinely committed to constructing a better
understanding of L2 motivational phenomena. Our point of departure was thus
that “raising legitimate concerns about previous methodology, analysis, or con-
clusions, is regarded as a laudable line of inquiry and not to be misconstrued
as a covert assault on the original author’s integrity” (Porte, 2012, p. 3). The
fact that our results did not eventually reproduce the pattern of findings in You
et al.’s study must, in the same light, not be seen as implying any disrespect
to its authors or undermining their sizeable contribution to the field to date.
For example, we note that some of the SEM advances that we have reviewed
have only become widely available in recent years—with some emerging only
Language Learning 70:1, March 2020, pp. 48–102 50
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
after the publication of You et al.’s study. Furthermore, You et al. used Amos
(Arbuckle, 2013), but we used Mplus (Muth´
en & Muth´
en, 1998–2012) and
R (R Core Team, 2014)—these latter both provided more flexibility and addi-
tional functionality. On these grounds, we must acknowledge that it should not
be too surprising that these methodological improvements, in combination with
some differences between our design and participants and those of the initial
study (described in detail below), have led to different results and conclusions.
It should also be noted that You et al. have made all their materials publicly
accessible in the IRIS Repository (Marsden, Mackey, & Plonsky, 2016), a de-
cision that made feasible the present conceptual replication, and it is one that
is also important for supporting open science practices more generally.
Background Literature
Vision and Motivation
Within a motivational science perspective (e.g., Higgins, 2014), human thought
and action is oriented toward the future, with the understanding that individuals
possess the capacity for change. The notion that vision is an essential element
for deliberately influencing the future has become popular in various social
disciplines (e.g., van der Helm, 2009). In this broad context, van der Helm
(2009) defined vision as “the more or less explicit claim or expression of a
future that is idealised in order to mobilise present potential to move into the
direction of this future” (p. 100). In a similar spirit, D¨
ornyei, Henry, and Muir
(2016) stated that vision in language learning motivation is “conceptualized as
a vivid mental image of the experience of successfully accomplishing a future
goal” (p. 22). Because vision “captures a core feature of modern theories of
L2 motivation” (D¨
ornyei & Kubanyiova, 2014, p. 9), it has recently come to
occupy a central place in language motivation research. Hadfield and D ¨
(2013) have additionally explained that “[w]hen we use the word ‘vision,’ we
use it literally . . . [as] more than mere long-term goals or future plans in that
they involve tangible images and senses (p. 2, original emphasis).
The View From Mainstream Motivational Psychology
The role vision of the self in future states plays has a long tradition in motiva-
tional psychology. An early attempt to elaborate on this role was through the
discrepancy-reducing function of future self-guides (Higgins, 1987, 1998). Ac-
cording to self-discrepancy theory, self-guides are “self-state representations
[that] are self-directive standards or acquired guides for being” (Higgins, 1987,
p. 321). Self-discrepancy theory highlights two such self-guides: the ideal self,
representing one’s own hopes and wishes, and the ought self, representing
51 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
duties and obligations expected by significant others. A similar notion is found
in possible selves theory (Markus & Nurius, 1986). Possible selves “derive
from representations of the self in the past and they include representations of
the self in the future . . . represent[ing] specific, individually significant hopes,
fears, and fantasies” (Markus & Nurius, 1986, p. 954). These visions of the
self in a future state are assumed to serve as reference points for the actual self
by harnessing a person’s hopes, aspirations, expectations, obligations, or fears
within a given domain. And, because they approximate an affective and plau-
sibly real experience of the individual in that desired or undesired future state
(Markus & Ruvolo, 1989), they may serve to energize action. According to this
view, vision derives its sustaining function from its ability to direct a person’s
behavior to approach or avoid a certain target. These promotion and prevention
functions are based on the understanding that individuals are motivated to ini-
tiate and persist in a course of action that will reduce perceived discrepancies
between their actual self and their personally relevant future self-guides and
bring one in line with the other (Higgins, 1987, 1996).
This theorizing notwithstanding, findings from a number of lines of re-
search in mainstream psychology have unveiled a less impressive role for
vision in conceptualizing and modeling motivation. One example is work in
which researchers explored the motivating function of positive thoughts and
mental images about a desired future (e.g., Oettingen, 1996, 2012; Oettingen &
Mayer, 2002). These findings highlighted unexpected twists in the explanatory
power of thinking about and imagining the future, suggesting that imagining
is a far from sufficient precursor to goal-directed action. In summarizing the
results of this line of research, Oettingen (2012) argued that “counter to what
the popular self-help literature proposes, positive thinking can be detrimental
to effort and success if it comes in the form of fantasies (free thoughts and
images about the desired future)” (p. 1). This line of research has demonstrated
that, both in immediate and delayed (some up to 2 years later) measurements
of effort and performance across multiple domains of human functioning, such
mental images have emerged as negative predictors that can actually hinder
increased effort and successful performance.
Looking also to sports psychology, the field from which vision interventions
originated decades ago and which inspired research into vision in L2 motivation
(Adolphs et al., 2018; D¨
ornyei et al., 2016), similarly raises questions about the
centrality of vision2in motivation and performance. In a recent meta-analysis
of the effects of mental imagery and vision interventions on biopsychological
outcomes related to both performance restoration and performance optimiza-
tion, Zach, Dobersek, Filho, Inglis, and Tenenbaum (2018) reported that the
Language Learning 70:1, March 2020, pp. 48–102 52
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
effects of imagery interventions were all statistically nonsignificant. These re-
searchers therefore argued that “much caution” (p. 85) is needed before making
claims about the effectiveness of mental imagery.
Educational psychology has similarly received with skepticism the applica-
tion of mental imagery as an instructional tool. For example, Dunlosky, Raw-
son, Marsh, Nathan, and Willingham (2013) classified it as a technique of low
utility. In their synthesis of the evidence available for 10 common techniques
under different learning conditions and in relation to different student charac-
teristics, different levels of material difficulty, and different outcome measures,
Dunlosky et al. described the impact of visualizing as “rather limited and not
robust” (p. 24) and concluded that the evidence generated from it remains a
“patchwork of inconsistent effects” (p. 25). This is perhaps also linked to Morin
and Latham’s (2000) results, which showed a moderating effect of visualization
ability on the relationship between mental imagery and outcomes. Those more
skillful in visualization were the ones who benefited the most from it. This adds
a further layer of caveats to any effective application of visualization techniques
to classroom instruction.
Finally, research by Paluck (2010) offered a cautionary tale about the im-
portance of scrutinizing hypothesized effects experimentally (see also Hiver &
Al-Hoorie, 2020). Paluck’s large (N=842) stratified experimental ethnographic
study aimed to promote intergroup tolerance among rival ethnic groups in war-
torn Democratic Republic of Congo. Paluck applied an imagine-self technique
wherein participants were encouraged to take the perspective of someone from
another ethnic group and image themselves in that person’s shoes in the hope of
enhancing empathy. Contrary to expectation, those applying the imagine-self
technique were actually less tolerant of other groups in their questionnaire re-
sponses, in their spontaneous comments, and in objective behavior of donating
to other groups. In one added surprise to these results, those who discussed
with others what they had imagined, a strategy intended to enhance the uptake
of the treatment, showed the lowest tolerance. In trying to explain these results,
Paluck suggested that the treatment might have primed intergroup grievances
and made the participants more aware of them, which led to an effect contrary
to what had been anticipated. In combination, these results from mainstream
motivational psychology suggest that the role of vision is at best debatable and
potentially problematic.
The View From L2 Motivation
The L2 motivational self system (D¨
ornyei, 2005, 2009) was proposed as a theo-
retical paradigm to account for motivation in language learning. Theoretically,
53 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
the L2 motivational self system draws from Markus and Nurius’s (1986) possi-
ble selves theory, Higgin’s (1987) self-discrepancy theory, and Gardner’s (1985,
2010) socioeducational model to reframe past research strands in language
learning motivation within a self framework. Key to the L2 motivational self
system is the premise that learners’ vision of themselves in the future (specifi-
cally the ideal L2 self and the ought-to L2 self) plays a key role in promoting
positive learning behaviors in the present (D¨
ornyei, 2014). The L2 motivational
self system additionally includes the L2 learning experience (sometimes called
attitudes to language learning), “which concerns situated, ‘executive’ motives
related to the immediate learning environment and experience” (D¨
ornyei, 2009,
p. 29).
The L2 motivational self system framework has led to a great deal of re-
search over the past decade (for reviews, see Al-Hoorie, 2018; Boo, D¨
& Ryan, 2015). Most of this research, however, has focused on examining the
correlations between these self-guides and self-reported intended effort (e.g.,
er & Kormos, 2009; Csiz´
er & Luk´
acs, 2010; Islam, Lamb, & Chambers,
2013; Kormos & Csiz´
er, 2008; Lamb, 2012; Magid, 2009; Papi & Abdol-
lazadeh, 2012; Ryan, 2009). Some other research has considered the correla-
tions between self-guides and perceptual styles (e.g., Al-Shehri, 2009; Kim,
2009; Kim & Kim, 2011, 2014; Yang & Kim, 2011). The results have generally
shown substantial correlations between self-guides and these self-report mea-
sures, leading some researchers to describe vision as one of the most important
variables in successful learning (e.g., D¨
ornyei & Kubanyiova, 2014, p. 2) and as
one of the most reliable predictors of long-term intended effort (e.g., D¨
et al., 2016, p. 23).
Taking this line of research further, D¨
ornyei and Chan (2013) examined the
relationship between different modalities (visual and auditory) and different
self-guides (ideal and ought) in two foreign languages (English and Mandarin)
in a sample of 172 Cantonese speakers in China. The researchers tested the
mental imagery function of future self-guides and reported that these were
associated with salient imagery and visualization components. In their sam-
ple, a visual perceptual style was significantly correlated with learners’ future
self-guides and, when combined with other sensory variables (the auditory
modality and imagery capacity), the correlation became stronger. In light of
these findings, they proposed that this vision is multisensory in nature, involv-
ing all modalities, not just visualization. A further important characteristic of
the imagery skills involved was their language-independent nature, pointing to
the conclusion that mental imagery is a more generic capacity rather than being
specifically L2 related. They emphasized the importance of imagery capacity
Language Learning 70:1, March 2020, pp. 48–102 54
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
in developing future self-guides and concluded that motivation may depend on
learners’ abilities to create such mental imagery.
Although the reviewed studies mostly concerned themselves with self-
report outcome measures, other research (though still observational) has in-
vestigated the correlation between self-guides and language achievement and
success. This line of research has shown that, in contrast to self-report mea-
sures, correlations with more objectively measured language learning behavior
and achievement were less substantial. Such findings have been repeatedly
obtained in research conducted in South Korea (Kim & Kim, 2011), Indone-
sia (Lamb, 2012), Canada (MacIntyre & Serroul, 2015), Iran (Papi & Ab-
dollahzadeh, 2012), and Saudi Arabia (Moskovsky, Assulaimani, Racheva, &
Harkins, 2016). In a meta-analysis of these effects, Al-Hoorie (2018) found that
the correlation between the Ideal L2 Self scale and language achievement was
weak, r=.20, 95% CI [.08, .32]. This weak correlation exhibited substantial
heterogeneity and became even weaker after correction for potential publica-
tion bias, r=.10, 95% CI [–.01, .22]. These findings echoed Moskovsky et al.’s
(2016) conclusion that the results “at best indicate a tenuous link between the
self guides and achievement” (p. 650).
A minority of investigators have moved beyond correlations between self-
reported measures and observational designs to the search for novel avenues
of how to motivate language learners through generating a language learning
vision and enhancing imagery (e.g., Adolphs et al., 2018; Sato & Lara, 2019)
and for superordinate and durable motivational forces that underpin long-term
persistence for L2 learning (D¨
ornyei et al., 2016; Ibrahim & Al-Hoorie, 2019).
In some of this research, researchers have specifically examined whether and
how classroom teachers can use imagined future states and mental projections
of the self in those states to mobilize and sustain current language learning
behavior toward that visionary target (e.g., Chan, 2014; Mackay, 2014; Magid,
2014; Magid & Chan, 2012; Sampson, 2012). This line of intervention re-
search, which is still in its infancy, has been characterized as having a number
of methodological limitations, including a lack of adequate blinding and a
tendency to emphasize qualitative components over objective criteria. When
researchers used objective criteria, the results were mixed, with Mackay (2014)
concluding that “it is not entirely evident whether any improvement in moti-
vational factors was due specifically to the development of an Ideal L2 self
or simply to the novelty of the approach” (p. 398). Furthermore, even if some
qualitative analyses have suggested that vision-based activities might increase
learners’ motivation, the subsequent link to learning effort, actual engagement
55 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
and participation, and language development and achievement remains unclear
(see Sato & Lara, 2019).
These methodological choices might help shed light on the discrepancy
among different lines of research within L2 motivation and between the appar-
ently positive results obtained in the L2 motivation field and the less positive
results obtained from mainstream psychology. Most of the L2 motivation liter-
ature to date has been observational, relying primarily on operationalizing L2
motivation through the three questionnaire-based constructs of the L2 motiva-
tional self system: ideal L2 self, ought-to L2 self, and L2 learning experience.
Without the use of experimental designs, causality among variables is hard
to establish. Nevertheless, these methodological limitations have not stopped
scholars in the field from taking conclusions to the next level. By moving be-
yond theoretical implications, as Henry and Cliffordson (2017) have recently
cautioned, “through practitioner-oriented volumes . . . [vision] has also found
its way into language classrooms” (p. 732). This has led Henry and Cliffordson
to compare the situation to a Kuhnian normal science where “central con-
cepts are adopted uncritically and anomalies ignored” (p. 732) in reference
to the argument that an understanding of vision aids teachers in increasing
their students’ motivation (Kim & Kim, 2014; Muir & D¨
ornyei, 2013). As an
illustration, in describing to language teachers the pedagogical implications of
vision, D¨
ornyei and colleagues stated:
While the day-to-day reality of one’s L2 learning experience is
determined by a myriad of situation-specific forces pulling and pushing
learners in different directions, the vision people have of the L2
speaker/user [whom] they would like to become seems, in the long run, to
be one of the most reliable predictors of long-term commitment and
effort. (D¨
ornyei et al., 2016, p. 42)
You et al.’s (2016) Study
One flagship study that characterizes this line of research was conducted by
You et al. (2016). Building on existing work on the L2 motivational self system
model, You et al.’s (2016) observational study aimed to investigate the contribu-
tion of vision-related variables in the setup of a whole language community—
surveying a large sample (over 10,000) of foreign language learners in China.
Their sample included two learner age groups (secondary and tertiary, mean
ages of 16.5 and 19.6 years, respectively), with the overall female ratio being
slightly overrepresented at 54:46. The researchers used 10 questionnaire scales
using a six-point Likert response format. These scales included the three L2
Language Learning 70:1, March 2020, pp. 48–102 56
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
motivational self system components, namely, the Ideal L2 Self, the Ought-to
L2 Self, and Attitudes to L2 Learning (adapted from Taguchi, Magid, & Papi,
2009). The authors also drew from three vision-related scales: Visual Style, Au-
ditory Style, and Vividness of Imagery (adapted from D¨
ornyei & Chan, 2013;
Kim, 2009; Kim & Kim, 2011). They further developed three other scales for
the purpose of their study: Ease of Using Imagery, Positive Changes in the
Future L2 Self-Image, and Negative Changes in the Future L2 Self-Image.
Finally, following the tradition in this line of research, they used an Intended
Effort scale as the outcome variable in their study. (The full list of items is
available in Appendix S1 of the initial study.)
In their initial analyses, You et al. (2016) used a chi-square test to show
that the majority of their participants reported that they had engaged in mental
imagery in their L2 learning, with the same pattern obtained for both male and
female learners (with ratios ranging from 2.4:1 up to 5.1:1). A multivariate
ANOVA further showed that those who had reported engaging in mental im-
agery additionally reported higher motivation on the Ideal L2 Self, the Ought-to
L2 Self, the Attitudes to L2 Learning, and the Intended Effort scales. This pat-
tern was consistent across their subgroups, namely, secondary school learners,
English majors, and non-English majors.
You et al. (2016) subsequently presented a SEM analysis that, they argued,
demonstrated that the three vision-related variables (visual style, auditory style,
and vividness of imagery) are empirical antecedents of the L2 motivational
self system components (ideal L2 self, ought-to L2 self, and attitudes to L2
learning), and the latter motivational measures in turn predict learners’ self-
reported intended effort to learn the L2. The authors further maintained that
their results showed that this model operates equivalently for male and female
learners. Summarizing their results, You et al. (2016) argued that “the findings
confirmed the significance of vision in general” (p. 120).
As we explained above, the primary outcome variable used in the initial
study was the Intended Effort scale. Admittedly, this Intended Effort scale—
which has been used extensively for a decade—contains items that are generic
in nature, inquiring, for example, about spending “a lot of effort” and “a
lot of time” and studying “very hard.” Generic intentions are less likely to
translate into behavior compared with more specific intentions and goals (Al-
Hoorie, 2016a, 2016b, 2018; Fishbein & Ajzen, 2010; Locke & Latham, 1990;
Teimouri, 2017). Alternatively, it has been noted that, “If we want to draw
more meaningful inferences about the impact of various motives, it is more
appropriate to use some sort of a behavioural measure as the criterion/dependent
variable (D ¨
ornyei & Ushioda, 2011, p. 200, original emphasis). Consequently,
57 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
if You et al. had used an alternative outcome measure such as more specific
intentions, objective language performance, or actual success and achievement,
they might have obtained a different pattern of findings.
The dominant conceptualization in the L2 motivational self system litera-
ture posits intended effort as an outcome variable. However, it may be plausible
to hypothesize that (generic) intended effort is actually an antecedent of moti-
vation. For instance, Oga-Baldwin and Nakata (2017) showed that task engage-
ment is a predictor of subsequent motivation. Extending this line of reasoning,
it seems plausible to argue that initial, generic intended effort could be a vari-
able facilitating task engagement and subsequent motivation, both eventuating
in language development. From this perspective, a generic intention of the sort
“I am prepared to expend a lot of effort in learning English” could be viewed as
a potential initial trigger for engaging in learning. The outcome variable could
then be language success, performance, or a more specific intention such as
“I try to read two short stories every week.” Few researchers have seriously
entertained this alternative hypothesis to date (see also Al-Hoorie, 2018).
Overall, the contrasting results within the L2 motivation field and between
L2 motivation and mainstream psychology in relation to the role of vision and
mental imagery have suggested that vision remains an open empirical ques-
tion warranting further investigation. We therefore decided that a preregistered
conceptual replication of the SEM model in You et al. (2016) would constitute
a systematic first step. We, thus, carefully examined You et al.’s report and
methodological choices as we prepared our preregistration protocols. Our pre-
registration protocols involved deviation from certain methodological choices
in You et al. and included the addition of an academic achievement variable in
order to shed more light on the role of vision.
Structural Equation Modeling Considerations in the Context of
You et al. (2016)
In this section, we review a number of methodological aspects in You et al.
(2016; henceforth, the initial study) in order to contextualize the design of our
conceptual replication as well as to provide a state-of-the-art account of the use
of SEM in language learning research. The initial study did not fully report
several technical points—an observation made elsewhere about a great deal of
L2 research (e.g., Al-Hoorie & Vitta, in press; Larson-Hall & Plonsky, 2015;
Marsden, Morgan-Short, Thompson, & Abugaber, 2018; Marsden, Thompson,
& Plonsky, 2018)—making it impossible for us to offer a complete evaluation
of the validity of the results, which thus emphasized the need for this repli-
cation. Following Marsden, Morgan-Short, Thompson, and Abugaber’s (2018)
Language Learning 70:1, March 2020, pp. 48–102 58
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
recommendation to spell out deviations from a replicated study as fully as
possible, we have surveyed these methodological issues from the initial study
in order to point out potential departures in our methodology from that of the
initial study.
General Design Issues
Stratified Sampling
To begin with, in the initial study, the authors described their method as stratified
sampling, proposing that “the large stratified sample lends credibility to the
results” (p. 119). They also explained:
In selecting our participants, a stratified sampling method was followed,
and while our limited recourses did not allow for fully random or
systematic sampling within each stratum of the sampling frame, it is
believed that the robust coverage ensures that no major motivational
trends have gone unnoticed. (You et al., 2016, p. 102)
The ultimate goal of a stratified sampling procedure is to obtain an accurate
estimation of population parameters. This requires researchers to weight the
sample size proportionally based on the size of each stratum. For example,
if Region A is considerably larger than Region B, then the sampling should
take that fact into account. Not doing so may lead to biased standard errors,
especially when the means are different (Levy & Lemeshow, 2008; Lumley,
2010). Although in the initial study, You et al. (2016) did not report consulting
census data to formally determine the size of each of stratum in their Chinese
population, we consulted official census data to guide our sampling decisions.
Model Fit
As measures of the robustness of findings that were central to the initial study,
the model fit indices showed some evidence of misfit. For example, the initial
study reported some comparative fit index (CFI), parsimonious CFI, normed fit
index, and nonnormed fit index—Tucker-Lewis index (TLI)—values that were
less than .90 (see Figures 1 and 2 in the initial study). The authors also reported
aX2/df ratio of over 25 (see Figure 1 in the initial study), much higher than
the recommended 3.0 or 5.0 threshold (Wheaton, Muth´
en, Alwin, & Summers,
1977). Although use of X2/df ratio as an index of fit has recently been criticized
(Goodboy & Kline, 2017; Kline, 2016), this extremely high value coupled
with the other goodness-of-fit indices below .90 raised concerns that the model
might have been misspecified and, thus, might be challenging to reproduce.
59 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
This was especially the case given that the X2/df ratio is affected by sample
size only when the model is misspecified (Marsh, Balla, & McDonald, 1988)
because it penalizes for excessive model complexity (West, Taylor, & Wu,
2012). Following recent practice (e.g., Kline, 2016), in our study, we therefore
have reported X2,df, and pand two incremental fit indices, CFI and TLI, and the
root mean square error of approximation, an absolute fit index. Furthermore,
because these model fit indices are global, their low values in the initial study
might have led the reader to wonder, additionally, about more local fit indices
(see below) that were not reported.
Measurement Model
Convergent and Discriminant Validity
The initial study drew from several scales, some of which had not been used
previously. The initial study did not report details about the psychometric prop-
erties of these scales apart from basic Cronbach’s alpha as an estimate of scale
reliability. A prerequisite step when researchers use SEM is investigating the
measurement model in order to satisfy certain psychometric conditions before
they conduct a structural model (e.g., Brown, 2015). The authors of the initial
study acknowledged that they had not empirically assessed the measurement
model, stating that “measurement models were drawn up based on the theoret-
ical considerations outlined in the review of the literature” (pp. 105–106).
The measurement model involves conducting a confirmatory factor analysis
with the aim of establishing the extent of construct validity of the latent vari-
ables in the model. This involves investigating both convergent and discriminant
validity (Fornell & Larcker, 1981; Hair, Black, Babin, & Anderson, 2010)—
both crucial for replication attempts. Convergent validity concerns whether the
indicators satisfactorily represent their latent constructs. One indicator of con-
vergent validity is factor loadings. A rule of thumb for satisfactory convergent
validity suggests that the standardized factor loading of each indicator variable
should be .70 or higher (or at least .50 or higher; Hair et al., 2010). The initial
study reported factor loadings as low as .31 and two at .40, with several others
missing (see Figure S3 in the appendix of the initial study) making it difficult
to assess the convergent validity of the latent variables in the model.
Another indicator of convergent validity is construct reliability (also called
composite reliability). In the context of latent variables, construct reliability is
best calculated using methods other than Cronbach’s alpha because it requires
the items to be τ-equivalent (i.e., with equal factor loadings; Raykov, 2004),
an assumption that is rarely satisfied in SEM data. Construct reliability is
Language Learning 70:1, March 2020, pp. 48–102 60
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
computed using J¨
oreskog’s rho formula,3displayed as Equation 1 (Fornell &
Larcker, 1981), where the reliability of the latent variable ηis a function of
(varies with) the squared sum of the standardized factor loadings (λ) and the
sum of the error variances (ε):
i=1λyi 2
i=1λyi 2+p
i=1Va r (εi)
SEM software packages do not automatically compute construct reliability,
so researchers must calculate it separately from the confirmatory factor analysis.
The rule of thumb for construct reliability is that it should ideally be .70 or
higher (Hair et al., 2010). The initial study reported only the Cronbach’s alpha of
its scales, which by itself might have reflected incomplete information about the
reliability of the latent variables. (For more about the controversy surrounding
bias in Cronbach’s alpha in the context of SEM, see Peterson & Kim, 2013,
and Raykov, 1998; for details on the controversy surrounding Cronbach’s alpha
more generally, see McNeish, 2018, Raykov & Marcoulides, 2019, and Sijtsma,
A further indicator of convergent validity is the average variance extracted
(AVE). The AVE aims to establish whether the variance captured by the latent
variable is larger than the variance due to measurement error. The AVE can be
computed using Equation 2 (Fornell & Larcker, 1981), where the AVE (ρvc )of
the latent variable ηis a function of the sum of the squared standardized factor
loadings (λ) and the sum of the error variances (ε):
yi +p
i=1Va r (εi)(2)
Again, SEM software packages do not compute AVE. Researchers must cal-
culate it for each latent variable from the confirmatory factor analysis output.
As a rule of thumb, the AVE should be .50 or higher (Fornell & Larcker,
Discriminant validity refers to whether the constructs are sufficiently dis-
tinct from each other. The recommended measure is that the AVE values should
be greater than their respective interconstruct correlations squared (Hair et al.,
2010). The rationale behind this rule of thumb is that the construct should
explain more of the variance of its items than it shares with other constructs.
Because the initial study did not report AVE values for its latent variables,
discriminant validity could not be assessed. This information is helpful partic-
ularly when there is substantial overlap in the items used for different scales.
61 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Tab l e 1 Overlap in item wording of different scales used in the initial study
Item Scale
Icanimagine myself in the future giving an
English speech successfully to the public in the
Ideal L2 Self
IfIwish,Icanimagine how I could successfully
use English in the future so vividly that the
images and/or sounds hold my attention as a
good movie or story does.
Vividness of Imagery
It is easy for me to imagine how I could
successfully use English in the future.
Ease of Using Imagery
In the past I couldn’t imagine of myself using
English in the future, but now I do imagine it.
Positive Changes of the Future
L2 Self-Image
In the case of the initial study, as Table 1 illustrates, items belonging to dif-
ferent scales had considerable wording overlap. Also in the SEM model of
the initial study, Vividness of Imagery predicted the Ideal L2 Self at .81 (see
Figure S3 in the initial study). This high coefficient raised further questions
about whether these were indeed two distinct latent variables. It was therefore
plausible that the items in Table 1 might turn out to be manifestations of the
same latent variable rather than to represent different latent variables. To ascer-
tain this, we investigated the convergent and discriminant validity of our model
using the methods described in this section before moving to the structural
Checking Assumptions
As with many statistical procedures, a number of assumptions need to be
satisfied before SEM results can be considered valid. Which particular as-
sumptions are required to be met depends on the details of the model. For
example, use of maximum likelihood estimation assumes that the data are
continuous and multivariate normal. In the initial study, the authors did not
make explicit which estimation method they had used, whether their ordinal
data were multivariate normal, or whether they had inspected outliers and
had dealt with them. When these assumptions are violated, alternative estima-
tion methods become more appropriate, including robust maximum likelihood
and diagonally weighted least squares (Li, 2016). In our study, we used the
diagonally weighted least squares estimation method because our data were
Language Learning 70:1, March 2020, pp. 48–102 62
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Another assumption is technically known as uncorrelated errors. Many
statistical procedures assume that errors are uncorrelated, and, if they are corre-
lated, special adjustments to the model need to be made. Errors can be correlated
when participants are sampled from discrete units such as classes, schools, and
regions. In such cases, some students will be more similar to each other (due
to their shared environment) than they would be if they are randomly sampled
from the population. The degree of dependence can be estimated empirically
using Equation 3 to calculate the intraclass correlation (ICC; Hox, 2010), where
the ICC is a function of the variance of the highest-level errors (σ2
u0) and the
lowest-level errors (σ2
The ICC can be obtained from software packages with multilevel functionality.
As an illustration of the seriousness of correlated errors, even a small ICC of
.01 can inflate Type I error rate from .05 to .17 in a sample of 100; an ICC
of .05 would elevate it to .43 (Barcikowski, 1981; Kreft & de Leeuw, 1998).
The degree of bias resulting from correlated errors can be estimated using the
design effect formula (Muth´
en & Satorra, 1995) of Equation 4, where cis the
average cluster size:
Deff =1+(c1)ρ(4)
Depending on the study design, the cluster might be the class, the school,
the neighborhood, or any other shared environment. Software packages do not
usually calculate design effect, so researchers need to calculate it themselves.
As a rule of thumb, a design effect of less than 2.0 is considered tolerable,
though some conditions require a more conservative threshold (Lai & Kwok,
2015). The authors of the initial study did not report the ICC or design effect
for their data, though the extremely large-scale nature of that study made it
likely that correlated errors existed. In our study, we adjusted standard errors
to correct for clustering within classrooms in our sample.
Local Fit
As we mentioned above, the model fit indices reported in the initial study were
measures of global fit, rather than local fit. As the name suggests, local fit can
point to specific problematic areas in the model. Although commonly used,
global measures have been criticized for not providing an adequate indication
of the size or exact location of misspecification or the lack of fit (Saris, Satorra,
& van der Veld, 2009), and this concern constituted part of the rationale for
63 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
our conceptual replication. A common approach to the evaluation of local fit
is the inspection of residuals. Residuals represent the discrepancy between the
hypothesized and observed covariance matrices. Residuals can be obtained for
every unique value in the model, thus allowing inspection of local misfit. As
Goodboy and Kline (2017) assert, “The real details about model fit can be found
by inspecting the residuals” (p. 74), whereas “testing for global fitness is often
of only minor use” (Pearl, 2009, p. 145). The larger the standardized residual
(usually ±2.0), the worse the fit. For some estimation methods (e.g., diagonally
weighted least squares) SEM packages provide the nor malized residuals instead,
which represent ratios of covariance residuals over the standard error of the
sample covariance and which are a more conservative measure of local fit
(Kline, 2016). In view of these concerns, we inspected local fit using normalized
residuals in our study.
Structural Model
Model Justification
The primary purpose of using SEM is to test and establish causal claims (Pearl,
2012). As with similar statistical analyses, SEM results are only as good as the
assumptions researchers hold about causality among the variables. In order for
SEM to be valid, these causal assumptions should be derived from experimental,
logical, and temporal considerations. Bollen and Pearl explain:
[D]evelopers and users of SEMs are under the mistaken impression that
SEMs can convert associations and partial associations among observed
and/or latent variables into causal relations. The mistaken suggestion is
that researchers developing or using SEMs believe that if a model is
estimated and it shows a significant coefficient, then that is sufficient to
conclude that a significant causal influence exists between the two
variables. Alternatively, a nonsignificant coefficient is sufficient to
establish the lack of a causal relation. Only the association of observed
variables is required to accomplish this miracle. (Bollen & Pearl, 2013,
p. 308)
In the initial study, the authors did not fully lay out the theoretical rationale
behind the model, which involved 10 structural paths, further justifying the
need for our conceptual replication, which allowed us to explore this and other
plausible models. For example, it was not made explicit why it was theorized
that the relationship between the auditory style and vividness of imagery should
be causal, why both the ideal L2 self and the ought-to L2 self had a causal effect
Language Learning 70:1, March 2020, pp. 48–102 64
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
on attitudes to L2 learning but not the other way around, or why both auditory
and visual styles had a direct effect on attitudes to L2 learning but vividness of
imagery did not.
In the initial study, the authors did, however, discuss the rationale of some
paths, though they acknowledged the controversy surrounding the direction
of these paths: “Because the L2 Motivational Self System was originally pro-
posed as a framework with no directional links among the three components,
past empirical studies employing structural equation modeling (SEM) have not
been uniform in specifying these interrelationships” (p. 97). In fact, past SEM
research has at times even been contradictory. For example, in a chapter by
Taguchi et al. (2009, p. 86) the authors hypothesized a SEM path with the ideal
L2 self attitudes to learning English; however, in a following chapter in the
same anthology, Kormos and Csiz´
er (2009, p. 100) hypothesized the opposite
SEM path of the L2 learning experience the ideal L2 self. This is puzzling
given that attitudes to learning English and the L2 learning experience, despite
what these expressions might imply, have actually been used synonymously
in the language motivation literature. In fact, in the initial study, You et al.
acknowledged that the two scale names were “largely terminological variation
because the specific questionnaire items that were used to tap into this compo-
nent were broadly similar across the studies” (pp. 96–97). Both studies reported
support for their respective model, though the rationale for the path direction
each adopted was not clear. One could argue that the two variables reinforce
each other, and therefore there is reciprocal causality between them (or, in tech-
nical terms, nonrecursive SEM paths). Although this third hypothesis is also
plausible, to our knowledge it has not been tested empirically or even theorized
A further justification for our conceptual replication was that the initial study
did not set out to test equivalent or competing models to examine how well they
account for the data. This problem has been highlighted by SEM methodolo-
gists, who have considered it a form of confirmation bias. Researchers select
their preferred model to confirm it while they overlook alternative models
that could potentially account for the data equally well. Although researchers
have typically overlooked them, “equivalent models exist for most published
applications, often in large numbers” (MacCallum, Wegener, Uchino, & Fab-
rigar, 1993, p. 196). Therefore, replication research should explore alternative
models to minimize confirmation bias, which is seen as a serious threat to
published SEM research (Kline, 2016, p. 296; see also Robles, 1996, and Shah
& Goldstein, 2006, for similar arguments).
65 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Figure 1 An illustration of the mindlessness of structural equation modeling. Two
different models resulting in identical structural coefficients and identical model fit.
Error terms have been removed for simplicity. CFI =comparative fit index; TLI =
Tucker-Lewis index; RMSEA =root mean square error of approximation; PCLOSE =
pof Close Fit.
Figure 1 illustrates more concretely the danger of overlooking alternative
conceptualizations of one’s hypothesized model. The figure presents two mod-
els, each with its structural coefficients and model fit (adapted from Albalawi,
2018). Panel A represents the hypothesis that experiences of disappointment in
daily life have a negative causal effect on learners’ ideal L2 self (note the head
of the arrow pointing from disappointment to ideal L2 self). Panel B, however,
represents the exact opposite hypothesis: Having a high (perhaps unrealistic)
ideal L2 self could lead learners to disappointment. Despite the contradic-
tory nature of the two models, they have identical structural coefficients and
identical model fit. Thus, regardless of whether the researcher advocates the
model in Panel A or Panel B of Figure 1, the SEM results would support the
researcher’s hypothesis equally. This example demonstrates that each and every
path in the SEM model needs to have a convincing (ideally experimental) ratio-
nale derived from prior research (Joe, Hiver, & Al-Hoorie, 2017; Yun, Hiver,
& Al-Hoorie, 2018). Otherwise, SEM software would mindlessly crunch the
numbers and return support for the model.
Finally, the initial study acknowledged that it had been a routine practice
in SEM studies to drop nonsignificant paths: “[W]hen certain links between
the ought-to self and other components did not reach significance, they were
Language Learning 70:1, March 2020, pp. 48–102 66
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
deleted from the SEMs” (p. 98). However, SEM methodologists have cautioned
that “deleting nonsignificant paths from a structural equation model is a terrible
way to trim the model” (Goodboy & Kline, 2017, p. 72, original emphasis).
In addition to capitalizing on chance, this procedure wastes an important fea-
ture of SEM over conventional significance testing: estimating the magnitude
of relationships. Deleting a priori hypothesized paths due to nonsignificance
defeats this purpose. The initial study was not explicit as to whether the re-
searchers had dropped any nonsignificant paths or to whether they had made
any post hoc modifications (and if so, what these modifications were). When
researchers perform modifications, SEM results become exploratory. Neither
was the initial study clear as to its purpose. At first, the authors stated that
the aim of the study was to explore the nature of this motivational role [of
vision]” (p. 107, emphasis added), but later stated in the conclusion that “the
findings confirmed the significance of vision” (p. 120, emphasis added). In our
conceptual replication, we adhered to our preregistration protocols and have
explicitly reported any deviation from these and from the initial study.
Measurement Invariance
Measurement invariance is concerned with whether two (or more) participant
groups interpret the items in a conceptually similar manner. For example,
when comparing males and females, an observed difference might be due to
one group indeed having a higher latent score, but it might also be due to
the two groups simply understanding the items differently. That the groups
understand the items in a similar way is considered “a logical prerequisite”
(Vandenberg & Lance, 2000, p. 9) to any meaningful interpretation of this
difference. Differential interpretation of items could occur in cross-cultural
and cross-age comparisons. Different levels of measurement invariance may
be established, most commonly configural (same number of factors across
groups), metric (or weak; equal factor loadings as a prerequisite for structural
coefficient comparisons), and scalar (or strong; equal item intercepts as a pre-
requisite for latent mean comparisons). When measurement invariance does not
hold, the measure is considered problematic and in need of further refinement
(e.g., see Davidov, Meuleman, Cieciuch, Schmidt, & Billiet, 2014). The same
is true in cross-time comparisons because the understanding of some abstract
notions may change over time. The initial study reported group differences
based on gender and use of visualization. However, because it did not report
measurement invariance results, it is not clear whether these differences were
genuine or an artifact of lack of measurement invariance. In our study, there-
fore, we ensured the satisfaction of measurement invariance before conducting
67 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
analyses of group differences. (For readers unfamiliar with the concept of mea-
surement invariance, detailed introductions such as Brown, 2015, chapter 7,
and Steinmetz, Schmidt, Tina-Booh, Wieczorek, & Schwartz, 2009, may be
Hypothesis Testing
To contextualize the design of our conceptual replication, a final analytical
concern relates to comparing coefficients for different groups without exam-
ining whether the differences between these coefficients are significant. Visual
comparisons lend inconclusive support if not backed by inferential testing. For
example, although the initial study stated, “we do find some discrepancies in the
scores of the two genders” (p. 111) and “the coefficient increased, and it peaked
at .31 in the most committed subsample” (p. 117), readers would also want
to know whether these differences were significant or not. In fact, even if one
coefficient is significant and the other is not, this may not necessarily imply that
the difference between them is itself significant (see Gelman & Stern, 2006).
Some procedures have been developed for this purpose in the context of SEM,
such as the Wald test (see Brown, 2015; Kline, 2016). Unfortunately, general-
izing without statistical hypothesis testing is a prevalent statistical problem in
the L2 field (Al-Hoorie, 2018; Al-Hoorie & Vitta, in press). In our study, we
used the Wald test to test hypotheses about group differences.
The Present Study
Having considered both open empirical questions and various methodological
issues relating to the initial study, we felt that a conceptual replication was
warranted to shed more light on the role of vision in language learning moti-
vation. In addition to these substantive and methodological rationales, another
criterion warranting replication was the weight a study has had on a field (e.g.,
Lindsay, 2015; Marsden, Morgan-Short, Thompson, & Abugaber, 2018). The
initial study has received numerous citations, as we described above, which
suggested that it has continued to have a major impact on the landscape of our
We have described our replication as conceptual (or constructive) due to
the deviation of our SEM model from that of the initial study. Conceptual
replications “introduce more than one significant change to the initial study
and can extend agendas in multifaceted ways but are in a weaker position
for ascribing different findings to the adaptations made to the initial study”
(Marsden, Morgan-Short, Thompson, & Abugaber, 2018, p. 366). Because our
replication was conceptual and because conceptual replications are in a weaker
Language Learning 70:1, March 2020, pp. 48–102 68
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
position for explaining results that are not in line with those of an initial study,
we have made no claim that our results are more valid than those of the initial
Along the same lines, we reiterate that our intention was not to direct
criticism toward particular lines of research, methodological traditions, or in-
dividual scholars but rather to serve as a constructive step in pushing the L2
motivation field forward through ever more refined and precise empirical in-
sights (see Porte, 2012). The misconception that replicators may be harassing,
or even bullying, the authors of an initial study (see Bohannon, 2014) may be
attributed to the fact that the language field has not yet fostered a replication
culture. As an illustration, Marsden, Morgan-Short, Thompson, and Abugaber
(2018) estimated that there have been as few as one published replication study
for every 400 journal articles in the L2 field—an estimate they still described
as “generous” (p. 344). Within the language motivation field more specifically,
the situation is acute. Inspection of Marsden, Morgan-Short, Thompson, and
Abugaber’s (2018) list of replication studies across 26 L2 journals revealed
only one self-labeled replication study on motivation (Mantle-Bromley, 1995).
Although there were likely more replication studies that were not self-labeled as
such in their titles or abstracts, lack of explicit labeling can be counterproduc-
tive. For example, authors and reviewers may not feel as compelled to scrutinize
interstudy variation and to attempt to explain or evaluate it, and consequently
“heterogeneity from one study to the next can pass largely unchecked” (Mars-
den, Morgan-Short, Thompson, & Abugaber, 2018, p. 365).
As a methodological safeguard, we preregistered our study prior to data
collection (a time-stamped copy can be found at Prereg-
istration involves specifying in advance the research questions, the detailed
study design, as well as the analysis plan and statistical model. This aims to de-
marcate exploratory versus confirmatory research and to minimize researcher
degrees of freedom, which can bias results in favor of preferred or anticipated
As we stated in our preregistration protocols, the primary purpose of this
study was to replicate You et al.’s (2016) SEM model (see their Figure S3). We
adhered to the basic design and procedures described in You et al. (2016). At
the same time, our preregistration protocols explicitly stated that we planned
to deviate from You et al.’s design in several aspects:
1. Due to the considerable overlap in the wording of items belonging to differ-
ent scales, we explicitly predicted in our preregistration protocols that some
scales might not show sufficient discriminant validity and would therefore
69 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
have to be combined or completely excluded. (As we have shown below,
this prediction was confirmed.) Since we were unable to anticipate which
scales would survive the measurement model—because this was an empir-
ical question—we expected our final model to be different from that of You
et al., rendering our findings exploratory (again we had anticipated this in
our preregistration protocols).
2. We also stated in our preregistration protocols that we would compare the
fit of competing models. Once more, as we were unable to anticipate which
scales would survive the measurement model, we acknowledged that this
aspect would also be exploratory.
3. As the researchers had done in the initial study, we planned to conduct
multiple-group analyses to compare gender, vision capacity (vision-yes vs.
vision-no), and vision change (change-positive vs. change-negative). How-
ever, because the scales measuring the latter two conditions (vision capacity
and vision change) did not survive the measurement model, this part of the
analysis could not be conducted.
4. We planned to conduct our conceptual replication in a neighboring country,
South Korea, in which we had convenient access to a comparable sample
of learners. We reasoned, too, that the findings of the initial study would
be of greater utility to the field if they were shown to generalize first to a
neighboring country and then to other, more dissimilar, parts of the world
in future studies.
5. In an attempt to broaden the scope of our investigation, we sought to include
a second outcome measure in addition to intended effort. We obtained both
midterm and final course grades in English from our participants. Given the
importance of this outcome variable, we considered this point an extension
to the design of the initial study (rather than a deviation from it).
It is in light of these departures from the design of the initial study that we
have described our study as a conceptual rather than a direct or partial replication
(see Marsden, Morgan-Short, Thompson, & Abugaber, 2018). Despite these
design changes—including the additional outcome variable—our ultimate aim
in this study remained unchanged: to evaluate the claim advanced in the initial
study that vision is one of the single most important variables in language
motivation (see also D¨
ornyei & Kubanyiova, 2014, p. 2). Preregistering our
design and analyses was intended to give an additional assurance of the validity
of our conceptual replication.
Language Learning 70:1, March 2020, pp. 48–102 70
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Our sample included 1,297 secondary school L2 learners (female =789) re-
cruited from middle schools (12–14 years old). After consulting national cen-
sus data from the Korean Statistical Information Service, the government office
within the Ministry of Strategy and Finance that collects population data yearly
(Korean Statistical Information Service, 2016), we endeavored to stratify our
sample proportionately from the two main geographic/administrative regions
(northern and southern) to represent their socioeconomic characteristics. We
targeted public middle school students (1.1 million total), gender distribution
(52% female, 48% male), and number of public middle schools in the respective
administrative regions (2,567 total; weighted 42.8% and 57.2%). We sampled
from the more densely populated and urban northern provinces including the
capital (n=556), and from all the remaining southern provinces (n=741) to
correspond with the sociogeographic census weightings. The response rate for
female respondents was slightly higher in our sample, pushing the final gender
ratio of our sample closer to 59:41.
When compared to the initial study, our sample was younger because the
age range of the initial study was 16–20 years. The gender ratio of the initial
study also featured more females both at the university level (62:38) and at the
secondary school level (53:47). In contrast, several key structural contingencies
made the classroom learning of English of equal instrumental utility in both
contexts (i.e., South Korea and China) at both the secondary and tertiary levels.
Across both settings, there was relative parity in the social value ascribed to
achievement in L2 English, an occurrence influenced by the outsized emphasis it
is given on standardized assessments in compulsory education and its perceived
importance in serving as a broad metric of academic success. In practice, then,
L2 English learning in both settings (South Korea and China) serves as a
social stratification metric that is key to success at different stages of life and
in different areas of society, and poor scores in English language learning in
secondary or tertiary education can relegate individuals to lower-tier learning
institutions and to types of employment perceived as less desirable (Muslimin,
The initial study used a total of 10 scales relating to several aspects of motiva-
tion and vision (see Table 2). We adopted these scales using a 7-point Likert
response format4(see Appendix S1 in the Supporting Information online for
the complete list of items). Table 2 presents the Cronbach’s alpha reliability
71 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Tab l e 2 Cronbach alpha reliability estimates for the 10 scales used in the present study
and in the initial study
Scale kInitial study Present study
Ideal L2 Self 5 .88 .91
Ought-to L2 Self 6 .74 .82
Attitudes to L2 Learning 5 .88 .92
Intended Effort 5 .81 .87
Vividness of Imagery 5 .91 .90
Visual Style 5 .66 .73
Auditory Style 5 .69 .76
Ease of Using Imagery 5 .85 .80
Positive Changes of the Future L2 Self-Image 3 .76 .82
Negative Changes of the Future L2 Self-Image 2 .80 .62
estimates from our Korean sample and shows how they compare to those of
the initial study’s Chinese sample. The reliabilities were broadly similar, with
the largest difference in Negative Changes of the Future L2 Self-Image, which
contained only two items. This was likely because Cronbach’s alpha, in addi-
tion to criticism that it has received in general terms as we reviewed above, is
particularly inappropriate with two-item scales (Eisinga, Grotenhuis, & Pelzer,
2013). We have reported Cronbach’s alpha for this two-item scale here only for
the sake of comparison with the initial study. Following the recommendation
of Eisinga et al. (2013), we computed the Spearman-Brown coefficient—which
turned out to be much lower (ρ=.45), indicating a critical problem with reli-
ability for this construct. The measurement model (see below) shed more light
on the psychometric properties of these scales.
We also sought to add a measure of L2 achievement to our analysis by
collecting the midterm and final scores (i.e., the two major assessments in each
grade, roughly 6 months apart) for each student. The original achievement
scores had a maximum of 100, which had to be rescaled to a maximum of 10
for model identification purposes. In this context, secondary schools cover L2
learning content from a national curriculum in a standard manner and sequence
to ensure fair opportunities for success on mandated assessments. Supplemen-
tary L2 material is often not permitted in regular secondary classrooms by
the South Korean Ministry of Education simply because it introduces unequal
opportunities and unpredictability in the classroom and does not correspond
Language Learning 70:1, March 2020, pp. 48–102 72
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
with the tightly regulated school-leaving assessments. These end-of-semester
assessments are a more objective gauge of general language proficiency than
other, more formative, assessments that occur in this instructional context be-
cause these summative assessments require learners to demonstrate that they
know particular L2 content and are able to use it. Importantly, although not
all of the schools administer the exact same test, these assessments tend to be
relatively homogeneous across schools and regions. The content and format
(i.e., multiple-choice, cloze, and transformation question types) are roughly
identical, drawing heavily on reading and listening passages that require a high
degree of comprehension and responses that demonstrate lexical, grammatical,
and discourse competence. In these final exams, females (M=78.23, SD =
20.30) achieved significantly higher than males (M=75.23, SD =23.60),
t(965) =2.35, p=.019, d=0.14, 95% CI [0.03, 0.25].
We had the questionnaire items used in the initial study translated into the
students’ native language (Korean) by a nonaffiliated researcher familiar with
the principles of questionnaire construction and English and Korean, and then
we back-translated the items to avoid large deviations in meaning. All materials
were administered in Korean. After receiving ethics approval, we approached
school administration and teaching faculty in schools nationwide. We ultimately
obtained written institutional consent from 12 schools (44 classes in total) to
collect data in the final weeks of the 2016 school year. Students from the schools
that agreed to participate completed the survey outside of their regular class
hours. Both their English L2 teacher and a research assistant were present to
inform them about the purpose of the survey and to obtain consent and to ad-
minister these materials. Students were reminded that participation was entirely
voluntary and were assured of the confidentiality of their responses. Through-
out, all participants were treated in accordance with American Psychological
Association ethical guidelines.
Data Analysis
In the analyses, we followed our preregistration protocols. We started with the
measurement model to investigate the psychometric properties of the scales. To
determine the number of underlying factors, we submitted the data to Mokken
scaling analysis and then to confirmatory factor analysis. To further ascertain
the number of factors, we conducted additional analyses using exploratory
factor analysis, scree plot, optimal coordinates, and parallel analysis. We han-
dled missing data using the default function in Mplus 7 (Muth´
en & Muth´
73 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
1998–2012), which estimates the model under missing data theory using all
available data. We corrected standard errors and chi-square tests to account for
nonindependence of observations because participants came from 44 classes.
Due to violation of multivariate normality, we applied a robust weighted least
squares estimator using a diagonal weight matrix. We investigated both global
fit (e.g., CFI, TLI, and root mean square error of approximation) and local
fit (normalized residuals). We computed normalized residuals by dividing the
residual of the unrestricted model by the standard error of the corresponding H1
(sample covariance) value (Kline, 2016). We have also reported factor loadings,
construct reliability, and AVE values.
For the structural model, we tested two competing models to minimize
confirmation bias. We compared a model in which intended effort was an
antecedent with a model in which intended effort was an additional outcome
variable together with achievement, and we then controlled for baseline achieve-
ment using midterm scores in the model that had exhibited better fit. Finally,
we established measurement invariance before comparing the two genders.
The Measurement Model
Multivariate Normality
To test the multivariate normality of the variables, we conducted a Mardina’s
test using the MVN package (Korkmaz, Goksuluk, & Zararsiz, 2014) in R
(Version 3.1.2; R Core Team, 2014). The results showed that our data were not
multivariate normal, both in skewness (218.22), X2=47172.92, p<.001, and
kurtosis (3006.03), z=216.24, p<.001. Figure 2 presents an illustration of
the lack of multivariate normality in our data. In a normal distribution scenario,
the dots would be expected to align to the straight line in the plot.
Number of Factors
Following our preregistration protocols, we first submitted the 10 scales to a
Mokken scaling analysis using MSP5 (Molenaar & Sijtsma, 2000). This pro-
cedure is a nonparametric item response theory model aimed at determining
the number of unidimensional factors underlying the data (Meijer & Baneke,
2004; van der Eijk & Rose, 2015). The analysis returned only four factors
(out of the 10 scales). Table 3 presents the scales and their associated items
that were scalable (the remaining items were not scalable). The table shows
that items from the Ideal L2 Self, Vividness of Imagery, Ease of Using Im-
agery, and Negative Changes of the Future L2 Self-Image scales all loaded on
the same factor (Factor 2). This pattern was in line with our analysis above
Language Learning 70:1, March 2020, pp. 48–102 74
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Figure 2 Quartile-quartile plot showing violation of multivariate normality.
(cf. Table 1), suggesting that, based on their parallel wording, these items were
more likely to be manifestations of one latent variable than different latent
variables. Similarly, Attitudes to L2 Learning and Intended Effort loaded onto
one factor.
Although in our preregistration protocol we explicitly predicted that some
scales would not exhibit sufficient discriminant validity, the reduction from
10 to four factors was surprising (more than a 50% reduction). Because this
was a substantial reduction, we decided to conduct further tests that we had
not preregistered in order to ascertain the validity of the Mokken results. We
examined our data using scree plot, eigenvalue over 1.0 criterion, parallel
analysis, optimal coordinates, and acceleration factor using the SPSS R-Menu
2.0 (Courtney, 2013; see Figure 3) as well as exploratory factor analysis. Most
of these pointed to four factors only, thus raising our confidence in this number
of factors. The acceleration factor suggested one factor only, but this method
has been criticized for underestimating the number of factors (Ruscio & Roche,
2012). The eigenvalue over 1.0 criterion, which is considered very unreliable
(van der Eijk & Rose, 2015), returned seven factors, which was also fewer than
the 10 factors hypothesized in the initial study. Appendix S2 in the Supporting
Information online presents the results from exploratory factor analysis, which
showed—perhaps more clearly—that Attitudes to L2 Learning and Intended
Effort loaded on one factor, and therefore it would not have been appropriate to
treat them as two separate scale variables in our study. In our case, we selected
the Intended Effort over the Attitudes to L2 Learning scale, as the former was
the (only) outcome variable in the initial study.
75 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Tab l e 3 Factors, homogeneities (H), and reliabilities (rho) resulting from Mokken
scaling analysis
Fac tor Ite m MItem H
Factor 1 ATLL1 4.11 .71
Fac tor H=.67, rho =.92 ATLL2 4.28 .72
ATLL3 4.52 .68
ATLL4 3.69 .66
ATLL5 4.14 .65
Intended1 4.94 .61
Intended2 4.32 .62
Factor 2 Ideal1 4.46 .64
Fac tor H=.66, rho =.96 Ideal2 4.02 .65
Ideal3 4.19 .65
Ideal4 4.42 .69
Ideal5 4.31 .68
Vivid1 4.27 .69
Vivid2 4.85 .67
Vivid4 4.35 .68
Vivid5 4.55 .69
Ease1 4.10 .63
Ease3 4.48 .68
Ease4 3.89 .62
Pos1 4.50 .62
Factor 3 Ought1 3.64 .69
Fac tor H=.69, rho =.81 Ought2 3.35 .69
Factor 4 Visual1 5.03 .69
Fac tor H=.69, rho =.81 Visual2 4.77 .69
Note. Homogeneity was set at a minimum of .60. ATLL =Attitudes to L2 Learn-
ing; Ease =Ease of Using Imagery; Ideal =Ideal L2 Self; Intended =Intended
Effort; Ought =Ought-to L2 Self; Pos =Positive Change in the Future L2 Self-Image;
Visu al =Vi sual Style; Vivid =Vividness of Imagery.
Confirmatory Factor Analysis
The results indicated that several scales analyzed in the initial study did not
seem to possess adequate psychometric properties to permit further analyses or
conclusions derived from them for the current study. The only scales that clearly
emerged from the analyses to this point were the Ideal L2 Self, the Ought-to L2
Self, Visual Style, and Intended Effort. This resulted in our design taking on a
more exploratory nature, as we had anticipated in our preregistration protocols.
Language Learning 70:1, March 2020, pp. 48–102 76
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Figure 3 Number of factors based on using scree plot, eigenvalues over 1.0 criterion,
parallel analysis, optimal coordinates, and acceleration factor.
Because our goal was to approximate the model in the initial study as closely
as possible, we conducted a confirmatory factor analysis on the four scales
emerging from our analysis. In all subsequent analyses, we further adjusted
standard errors to correct for clustering within the 44 classes in our sample
(ICC =.117, Deff =4.33). We also had to exclude one item from the Visual
Style scale (Visual2; see Appendix S1) to improve convergent validity. This
scale also had the lowest reliability in the initial study (see Table 2). It was
likely that we had to exclude this item because it was the only item specific
to using English fluently, whereas all other items in that scale were concerned
with using English successfully or skillfully more generally. In any case, we
had no reason to believe that dropping one item had a substantial impact in a
conceptual replication.
77 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Tab l e 4 Reliability, validity, and interconstruct correlations for the scales of the mea-
surement model
Scale CR AVE Ought Ideal Intended Visual
Ought .851 .494 .703
Ideal .923 .706 .319 .840
Intended .889 .615 .395 .767 .784
Visual .781 .475 .329 .559 .664 .689
Note. Values in the diagonal are the square roots of their respective average variance
extracted (AVE). CR =construct reliability; Ought =Ought-to L2 Self; Ideal =Ideal
L2 Self; Intended =Intended Effort; Visual =Visual Style.
Table 4 shows that all variables exhibited acceptable construct reliability
(over the .70 threshold), though the AVEs of the Ought-to L2 Self and Visual
Style scales were just under the .50 threshold. Discriminant validity was satis-
fied because the square roots of AVEs (shown in the diagonal of Table 4) were
higher than their respective inter-construct correlations. Overall, this model
showed a reasonable global fit, X2(164) =1,132.545, p<.001, CFI =.954,
TLI =.947, though the root mean square error of approximation was bor-
derline, .067, 95% CI [.064, .071], p<.001. Inspection of the normalized
residuals showed that most residuals were within ±2.0, with the highest being
2.1 (between Ought3 and Intended5). All standardized factor loadings were
statistically significant, and most were above .70, with the lowest being .46 for
one Ought-to L2 Self scale item (see Table 5).
The Structural Model
Two Potential Models
As we discussed above, most of the recent SEM literature in the L2 motiva-
tion field has relied on testing a single model hypothesized by researchers,
though this risks confirmation bias. In the vast majority of SEM studies, re-
searchers have hypothesized the ideal L2 self, the ought-to L2 self, and the
L2 learning experience to have a causal effect on intended effort (see Al-
Hoorie, 2018). As we discussed above, intended effort may also be conceptu-
alized as a potential antecedent that could facilitate engagement, motivation,
and language development. In this study, therefore, we tested two competing
1. Following the model in the initial study, we hypothesized that Intended
Effort is the outcome of the Ideal L2 Self and the Ought-to L2 Self.
Language Learning 70:1, March 2020, pp. 48–102 78
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Tab l e 5 Standardized and unstandardized factor loadings, standard errors, and zratios
of scales in the measurement model
Path βBSE z
Ideal L2 Self Ideal1 .83 0.011 73.27
Ideal2 .84 1.01 0.011 75.45
Ideal3 .81 0.98 0.013 62.72
Ideal4 .87 1.05 0.009 93.33
Ideal5 .85 1.02 0.011 77.57
Ought-to L2 Self Ought1 .77 0.015 51.70
Ought2 .76 0.98 0.017 43.15
Ought3 .75 0.96 0.017 44.75
Ought4 .46 0.59 0.027 17.32
Ought5 .78 1.01 0.013 59.28
Ought6 .65 0.84 0.021 31.21
Visu al S tyle Visual1 .72 0.017 42.25
Visual3 .61 0.85 0.027 22.46
Visual4 .80 1.11 0.014 57.67
Visual5 .62 0.86 0.021 29.04
Intended Effort Intended1 .83 0.009 90.55
Intended2 .82 0.99 0.011 75.18
Intended3 .74 0.90 0.013 55.66
Intended4 .78 0.94 0.013 61.41
Intended5 .75 0.90 0.014 54.42
Note. All coefficients significant at the p.001 level.
2. Extending the model in Oga-Baldwin and Nakata (2017), we hypothesized
that Intended Effort is an antecedent of the Ideal L2 Self and the Ought-to
L2 Self.
In both models, we treated Visual Style as a predictor of the Ideal L2 Self and the
Ought-to L2 Self following D¨
ornyei and Chan’s (2013) argument that learners
with a visual sensory style preference are more likely to develop stronger self-
guides (see also D¨
ornyei, 2014). We also added L2 achievement as an outcome
variable in both models.
Figure 4 (see also Table 6) shows the results of these analyses. In both
models, the modification indices suggested one covariance term between two
Visual Style items, and all normalized residuals were under ±2.0. The model
hypothesizing Intended Effort as a predictor (Figure 4, Panel B) showed a better
79 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Figure 4 Two competing models. All coefficients significant at the p.01 level un-
less otherwise indicated. Akaike information criterion (AIC) and Bayesian information
criterion (BIC) were obtained through robust maximum likelihood estimation. CFI =
comparative fit index; TLI =Tucker-Lewis index; RMSEA =root mean square error
of approximation.
Language Learning 70:1, March 2020, pp. 48–102 80
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Tab l e 6 Standardized and unstandardized structural coefficients, standard errors, and zratios with and without controlling for baseline
Without controlling for baseline
With controlling for baseline
Path βBSEz p βBSEz p
Ideal L2 Self Final achievement .44 1.24 0.13 8.88 <.001 .09 0.23 0.05 5.15 <.001
Ought-to L2 Self Final achievement .10 0.29 0.09 3.40 .001 .00 0.00 0.05 0.05 .957
Intended Effort Ideal L2 Self .62 0.62 0.05 12.31 <.001 .58 0.59 0.05 12.47 <.001
Ought-to L2 Self .24 0.23 0.06 3.55 <.001 .27 0.24 0.06 3.93 <.001
Visu al Style Ideal L2 Self .19 0.28 0.08 3.49 <.001 .19 0.31 0.09 3.45 .001
Ought-to L2 Self .20 0.28 0.09 3.06 .002 .23 0.33 0.09 3.64 <.001
Baseline Achievement
Ideal L2 Self .10 0.04 0.01 5.14 <.001
Ought-to L2 Self .14 0.05 0.01 5.22 <.001
Intended Effort .32 0.13 0.01 11.18 <.001
Visual Style .31 0.08 0.01 8.30 <.001
Final achievement .81 0.82 0.03 28.43 <.001
81 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
overall fit and smaller Akaike information criterion and Bayesian information
criterion values, suggesting that it was more likely to be replicable. Controlling
for baseline achievement in the better fitting model (Table 6) had a minor effect
on most coefficients, with the important exception being a notable drop in the
path from the Ideal L2 Self to achievement from .44 to .09, while the initial
coefficient of the Ought-to L2 Self (–.10) became 0.
Measurement Invariance
We investigated measurement invariance before conducting gender compar-
isons. We constrained both factor loadings and item thresholds to be equal
across the two genders. This model, however, failed to satisfy invariance,
X2(111) =294.277, p<.001. Subsequently, we freed the thresholds except
for one in each item in addition to another in the marker variable. This proce-
dure resulted in invariance being satisfied, X2(88) =104.234, p=.114. These
results indicated that weak invariance (i.e., factor loadings) was satisfied, but
strong invariance (i.e., item thresholds) was only partially satisfied.
Gender Differences
Because measurement invariance was partially satisfied across the two genders,
it became justifiable to conduct a multigroup SEM. We conducted this procedure
by constraining each path to be equal across the two genders and then examining
whether the model fit deteriorated significantly as a result of this equality
constraint. Deterioration of model fit would have indicated that the paths were
not in fact equal.
Table 7 presents the results of these additional models. The results showed
that one path was significantly different across the two genders and remained
significant after Bonferroni correction. This finding suggested a stronger re-
lationship between Intended Effort and the Ought-to L2 Self scales for males
but not for females. No other path was significantly different between the two
Summary of the Main Findings
This article has presented a preregistered conceptual replication of a recent
study by You et al. (2016). Our results do not point to any gender difference
in language motivation or achievement that is clearly attributable to the role of
vision. On the other hand, our results do point to the superiority of a model
positing that intended effort is an antecedent, rather than an outcome, of mo-
tivation. Overall, our final SEM model bears little resemblance to that of the
Language Learning 70:1, March 2020, pp. 48–102 82
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Tab l e 7 Standardized coefficients, standard errors, and Wald tests of parameter con-
initial study primarily due to the number of scales that we eventually used.
The initial study used 10 scales, but our analyses revealed only four scales
showing adequate psychometric properties. This points to the urgent need for
further psychometric research to refine scales that are frequently cited in the
L2 motivation literature. Several novel lines of research have begun to do this
(e.g., Papi, Bondarenko, Mansouri, Feng, & Jiang, 2019; Teimouri, 2017).
At the same time, we must acknowledge the possibility that the results
of the initial study might still be different from ours had You et al. followed
the same procedures as we used. This could be due to cultural differences
between the Chinese and South Korean populations, though, as we explained
above, we did not consider this a very likely scenario due to the usefulness of
and emphasis on English learning in the two neighboring countries. A more
likely explanation for a potentially genuine, rather than a methodologically
artifactual, difference between the two studies is the type of participants. There
was a difference in the gender ratio of the two samples, but it was rather minor.
A clearer difference, however, was in the age of the participants (12–14 years
old in our sample and 16–20 years in the initial study). This age variable
83 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
could have contributed to a genuine interstudy variation. However, the role
of age in language learning motivation is still not well understood, and this
is primarily due to the sampling methods that are (understandably) limited
by access and availability (see Boo et al., 2015). A more systematic research
program targeting learners with different ages, preferably employing multisite
registered replication reports (see below), should elucidate the contribution of
this variable to language learning motivation.
Vision and Gender
In our study, we were able to use only one of the three vision-related scales
from the initial study, namely, Visual Style. Therefore, we were unable to
examine the results obtained in the initial study in relation to either Vividness
of Imagery or Auditory Style. We leave it to future research, first, to develop
scales with adequate psychometric properties for these two constructs and,
second, to attempt to replicate You et al.’s findings from them. As for Visual
Style, our results show that this variable predicted the Ideal L2 Self rather
modestly (β=.19). This magnitude was equivalent for the two genders. We
did not, therefore, find support in our data for the idea that females are better
learners due to their superior vision-related capabilities. At the same time,
our findings support those found by Kim and Kim (2011), who showed that a
dominant visual preference is a weak variable in L2 learning. Other researchers
(e.g., Lamb, 2012; Moskovsky et al., 2016; Papi & Abdollahzadeh, 2012) have
advanced similar arguments about vision and self-guides in recent years.
Although our findings do not support a gender difference in relation to
visual style, our exploratory results do point toward one potential difference.
The relationship between Intended Effort and the Ought-to L2 Self scales was
markedly stronger for males. Assuming causality, this pattern implies that—for
males—the impact of the desire “to meet expectations and to avoid possi-
ble negative outcomes” (D ¨
ornyei, 2009, p. 29) that are “imposed” (p. 32) by
peers, parents, and authoritative figures becomes relevant only when sufficient
intended effort to learn the language exists in the first place. For females, how-
ever, this social element of the ought self-guide seems more persistent and not
necessarily connected to an initial intended effort. This interpretation seems
consistent with behavior genetics findings. According to Bouchard (2004; see
also Al-Hoorie, 2015), social attitudes are a domain where gender differences
clearly emerge—with heritabilities of .65 for males but only .45 for females,
suggesting that environmental influences may be more salient for females. This
implies that females tend to be more open to social persuasion than males.
This interpretation leads us to advance the ought gender-difference hypothe-
Language Learning 70:1, March 2020, pp. 48–102 84
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
sis: The development of an ought-to L2 self is more likely to be facilitated
by the presence of a strong intended effort for males, but such a moderating
effect is less likely to exist for females. This hypothesis might help explain the
conflicting results (see Al-Hoorie, 2018) obtained for this construct to date.
We reiterate that this interpretation assumes that this exploratory finding will
replicate in future research and that this relationship is causal, which is yet to
be demonstrated experimentally in the language motivation field.
Intended Effort and Attitudes to L2 Learning
Another curious finding from our study was the lack of discriminant validity
between the Intended Effort and the Attitudes to L2 Learning (also called the
L2 Learning Experience) scales. Interestingly, Attitudes to L2 Learning has
frequently been described as the strongest predictor of Intended Effort (e.g.,
ornyei, 2019; Lamb, 2012). Our results suggest that this predictive validity
may be inflated because these two scales do not represent two clearly unidi-
mensional constructs. For instance, the items “I always look forward to English
classes” and “I would like to spend lots of time studying English” appeared in
the Attitudes to L2 Learning and the Intended Effort scales, respectively, in the
initial study. However, our results suggest that looking forward to something
and the desire to spend lots of time doing it do not psychometrically represent
two latent constructs. Indeed, it is rather mundane to find out that those who
self-report looking forward to something also self-report a desire to spend a lot
of time doing it (see also Gardner, 2010, p. 73, for a similar critique).
As an illustration of the prevalence of discriminant validity issues, Al-
Hoorie (2018), in a meta-analysis, compared the correlation between the In-
tended Effort and the L2 Learning Experience scales in studies that used a
factor-analytic procedure (to psychometrically ascertain the unidimensional-
ity of the two scales before examining their correlation) versus studies that
did not. The results showed a significant drop from .68 for studies without a
factor-analytic procedure to .41 for studies that implemented one. Apparently,
researchers who conducted an appropriate factor-analytic procedure were able
to detect and exclude problematic items (e.g., items loading on both constructs)
before computing correlation coefficients. An example cited in Al-Hoorie’s
(2018) meta-analysis demonstrated the severity of this problem: One study
used the item “Learning English is one of the most important aspects in my
life” in the L2 Learning Experience scale, and the item “It is extremely im-
portant for me to learn English” in the Intended Effort scale. Because these
two items are almost identical, it would be very hard to imagine that they un-
derlie two distinct constructs. Overall, these combined results suggest that the
85 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
L2 learning experience’s predictive validity of intended effort seems, in many
reports, exaggerated by some wording overlap in the items of these two scales.
Furthermore, in our conceptual replication, we added L2 achievement as
an outcome measure. As we explained above, we do not consider this decision
a deviation from the initial study but an extension of it. After all, with a
reduction from 10 to four scales, we were far beyond the point of reproducing
the initial study’s original model at any level of accuracy. Still, this additional
variable allowed us to engage in model comparison procedures that led to some
interesting insights. That is, we speculated that the Intended Effort variable,
commonly used as an outcome measure, might be argued to signal initial interest
in the language and in engaging in learning behaviors. Based on this, general
items like “I am prepared to expend a lot of effort in learning English” can be
said to mark the initial interest required to engage in the learning process and
its demands. In contrast, other specific (particularly increasingly challenging)
items then are more suitable for tapping into the outcome of this motivation
(Al-Hoorie, 2018).
As an illustration of this notion consider an example from sports, one
field from which the L2 vision tradition has borrowed insights and metaphors.
Initially, it makes sense to ask prospective trainees whether they are willing to
expend a lot of effort and to spend lots of time practicing a certain sport. Without
such initial willingness, serious commitment and dedication to that sport and
the effort required to attain competence are unlikely. After motivation takes
shape, however, it makes less sense to keep asking the same generic questions
about willingness to expend effort. Instead, more specific and increasingly
challenging questions are more appropriate as a criterion gauging trainees’
level of motivation, such as commitment to practice despite cold weather, on
holidays, for longer hours, and the like. Such specificity is most likely a better
test of trainees’ motivation and more clearly distinguishes it from the motivation
of other trainees.
Going back to the language learning domain, it seems that a general in-
tended effort should similarly be considered as signaling initial interest in learn-
ing the language and putting in the effort to engage with its demands. Once
engagement actually does occur, there will be a dynamic interaction between
motivation (e.g., the ideal L2 self, ought-to L2 self, integrative motivation, in-
trinsic motivations, etc.) and task demands, leading to continuous recalibration
of that motivational construct such as forming realistic aspirations and ability
expectations (Bandura, 1986, 1997). An appropriate test of the value of the
motivational construct in question, subsequently, should be its ability to pre-
dict more specific and challenging motivational intentions rather than generic
Language Learning 70:1, March 2020, pp. 48–102 86
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
ones again. Examples include willingness to engage in optional work (e.g.,
additional homework tasks) versus only required work, and in agentic learning
(e.g., self-directed and autonomous activities) versus only prescribed activities.
More highly motivated learners are expected to exhibit more willingness to
engage in these challenging tasks (see Al-Hoorie, 2018, for a discussion) and,
as a result, attain higher competence. In our case, not only did our concep-
tualization of (generic) intended effort as an antecedent fit the data, but—to
our surprise—it also fit better than the more conventional model where it has
been conceptualized as an outcome variable. We repeat our caveat that these
findings are based on exploratory analyses, and so future replication research
should attempt to test them further, ideally longitudinally. We also encourage
researchers to investigate the possibility of nonrecursive paths, where causality
is reciprocal rather than unidirectional, as has always been assumed in our field.
Potential Confounds in Nonexperimental Research
When it comes to self-guides predicting achievement, our results also provide
some interesting insights. The results (see Panel B of Figure 4) show that results
from the Ideal L2 Self and the Ought-to Self scales initially predicted achieve-
ment at β=.44 and β=–.10, respectively. However, after controlling for
baseline achievement (see Table 6), the results dropped to β=.09 and β=.00,
respectively. The latter coefficients both fell within the 95% confidence inter-
vals of a recent meta-analysis (Al-Hoorie, 2018). The drop after controlling for
baseline achievement has been observed in previous research as well (e.g., Joe
et al., 2017; Yun et al., 2018), and suggests that many correlational results in the
literature could be inflated due to lack of baseline adjustment. This is especially
applicable in observational research (as opposed to experimental research) in
which researchers do not manipulate conditions, assign participants to groups
randomly, or use matched randomization. In observational research, various
confounds can creep in and seriously inflate (and occasionally suppress) the
relationship that one intends to observe (e.g., Beleche, Fairris, & Marks, 2012,
p. 709). Using the same logic, it is not implausible that the correlation between
a hypothesized motivational construct and achievement might be confounded
by similar variables. Despite the risk of statistical overcontrol (see Bandura,
1997, p. 69), ability level and previous achievement are strong candidates for
potential confounding variables, and so it is hard to rule out the possibility
that success breeds success and that the hypothesized motivational construct
is simply a byproduct of this process (e.g., Hiver et al., 2019). Correlation is
not causation. In short, although our results support a modest role for the Ideal
L2 Self scale (β=.09, or less than 1% of the variance), they show that the
87 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Ought-to L2 Self scale results played virtually no role in predicting L2 achieve-
ment (β=.00).
Replication, Future Research, and Open Science Practices
To our knowledge, this article is the first preregistered replication study, and one
of the few self-labeled replication attempts, in the language motivation field. We
hope that greater numbers of language motivation researchers will take up this
initiative by engaging in direct, partial, and conceptual replications of critical
results in the literature (see Marsden, Morgan-Short, Thompson, & Abugaber,
2018, for a discussion of the kinds of findings that warrant replication) and by
encouraging their graduate students to do so. It is vital for a field to keep track of
the replicability of its findings. If results do not replicate, the field can engage in
self-correction efforts to reduce the chances of a perceived replication crisis—
as has been reported in a number of disciplines including biology, genetics,
medicine, and psychology (Schooler, 2014).
We emphasize, yet again, that we do not claim that You et al.’s results were
unfounded, were merely an artifact of their methodology, or were flawed. It
is by no means unusual for replication research to obtain divergent results,
particularly when the replication is conceptual, and so this should not be
viewed as dismissing the value of the initial study. Instead, the natural im-
plication in such cases is the need for further research and more robust designs
to better understand the phenomena in question, especially because SEM in
particular relies on a set of stringent but often unmet assumptions (Good-
boy & Kline, 2017, p. 69). Many language motivation researchers may be
reluctant to do this considering the Kuhnian normal science status that the
L2 motivational self-system has reached, which can give the perception that
it is above critique (see Henry & Cliffordson, 2017). To ensure meaningful
interpretation of existing results, a systematic replication program would al-
low our field to take stock of the best empirical evidence for a cumulative
Of course, all this points to the need to replicate our own model as well.
Nevertheless, our recommendation for future research would not in fact be to
suggest that scholars conduct yet more SEM studies, at least not initially and
not in isolation. A more fruitful direction would be to first establish causality
among variables experimentally (for similar calls, see Al-Hoorie, 2018; Lamb,
2017; Yun et al., 2018). We also recommend that motivation researchers delib-
erately advance hypotheses for future testing as we have done in this article. For
one reason or another, and despite a recent surge in publications (see Boo et al.,
2015), the language motivation field has not developed a culture of hypothesis
Language Learning 70:1, March 2020, pp. 48–102 88
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
testing. In fact, unlike various sister L2 subdisciplines where hypothesis testing
is routine, one would be hard pressed to think of a hypothesis explicitly for-
mulated in such a way that it could be tested independently in L2 motivation
Whether choosing to engage in SEM, experiments, or other types of re-
search, we strongly encourage future researchers to preregister their designs
and analysis plans to maximize transparency and to minimize the possibil-
ity of (unintentional) questionable research practices, such as reporting only
significant results or reframing the research questions and hypotheses to fit
the results already obtained, which is often referred to as “hypothesizing after
the results are known” (see Kerr, 1998). One valuable initiative in this regard
is the Registered Reports housed at Language Learning (Marsden, Morgan-
Short, Trofimovich, & Ellis, 2018), in which authors are invited to submit
the full method and analysis protocol of their proposed study—along with its
conceptual justification—before the actual data collection. Submissions that
pass an initial peer review stage are conferred an in-principle acceptance sta-
tus, guaranteeing subsequent acceptance by the journal regardless of the re-
sults, provided that researchers adhere to the original method and analysis
protocols. This approach should encourage more researchers to publish their
reports irrespective of the findings that they obtain and, it is hoped, should
minimize publication bias. A particularly strong approach, achievable within
the Registered Report publication route, is a multisite replication in which re-
searchers at different sites attempt to replicate one finding independently while
adhering to preregistered protocols (for more details, see Morgan-Short et al.,
Finally, we would also encourage researchers to make their datasets publicly
available whenever possible. The uptake of these open science initiatives has
slowly started to grow, evidenced by the increasing number of leading journals
in the field that have begun to champion them, and we look forward to these
initiatives having a net positive effect on research practices and empirical
evidence in the language motivation field.
All in all, the involvement of vision, imagery, and sensory modalities has
been argued to represent a “key aspect” (D¨
ornyei, 2014, p. 10) of future self-
guides, particularly the ideal L2 self. However, the present study did not offer
unambiguous support for a vision scale that is distinct from the ideal L2 self,
itself shown to be a weak predictor of academic achievement in the L2. On
balance, then, if there is one thing to which the evidence available so far may
89 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
point, it is that the notion of centrality of vision, involving tangible images and
senses, to language learning motivation deserves further substantiating.
Final revised version accepted 3 June 2019
1 In current language learning motivation literature, no functional distinction is made
between mental imagery and vision. These terms are used interchangeably and
often alternatively for stylistic reasons. Throughout this study, we adopted the term
vision in line with the initial study.
2 The field of sports psychology has adopted the term motor imagery to refer to any
mental practice that involves imagination or vision.
3 We have presented several mathematical formulas in the text. We have chosen to
present these because standard SEM software does not calculate these indices and,
therefore, researchers must rely on these formulas to calculate these indices
themselves. Lowry and Gaskin (2014) have provided an accessible introduction to
these concepts, and Gaskin’s (2016) Stats Tools Package calculates most of the
formulas presented in this article within an Excel spreadsheet.
4 After we had concluded our data collection, we realized that the initial study had
used a 6-point Likert response format. This minor difference likely had minimal
impact on our results, especially because research by Felix (2011) showed hardly
any difference whether a 3-, 5-, 7-, or even 9-point scale is used, suggesting that
“scale width does not influence important indicators such as means, standard
deviations and skewness” (p. 143).
5 Readers interested in this and similar topics may consult the recently launched
journal Advances in Methods and Practices in Psychological Science.
Adolphs, S., Clark, L., D¨
ornyei, Z., Glover, T., Henry, A., Muir, C., . . . Valstar, M.
(2018). Digital innovations in L2 motivation: Harnessing the power of the Ideal L2
Self. System,78, 173–185.
Albalawi, F. H. E. (2018). L2 demotivation among Saudi learners of English: The role
of language learning mindsets (Unpublished doctoral dissertation). Nottingham,
UK: University of Nottingham.
Al-Hoorie, A. H. (2015). Human agency: Does the beach ball have free will? In Z.
ornyei, P. MacIntyre, & A. Henry (Eds.), Motivational dynamics in language
learning (pp. 55–72). Bristol, UK: Multilingual Matters.
Al-Hoorie, A. H. (2016a). Unconscious motivation. Part I: Implicit attitudes toward L2
speakers. Studies in Second Language Learning and Teaching,6, 423–454.
Language Learning 70:1, March 2020, pp. 48–102 90
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Al-Hoorie, A. H. (2016b). Unconscious motivation. Part II: Implicit attitudes and L2
achievement. Studies in Second Language Learning and Teaching,6, 619–649.
Al-Hoorie, A. H. (2018). The L2 motivational self system: A meta-analysis. Studies in
Second Language Learning and Teaching,8, 721–754.
Al-Hoorie, A. H., & Vitta, J. P. (in press). The seven sins of L2 research: A review of
30 journals’ statistical quality and their CiteScore, SJR, SNIP, JCR impact factors.
Language Teaching Research. Advance online publication
Al-Shehri, A. S. (2009). Motivation and vision: The relation between the ideal L2 self,
imagination and visual style. In Z. D¨
ornyei & E. Ushioda (Eds.), Motivation,
language identity and the L2 self (pp. 164–171). Bristol, UK: Multilingual Matters.
Arbuckle, J. L. (2013). IBM RSPSS RAmosTM 22 user’s guide. Meadville, PA: Amos
Development Corporation.
Bandura, A. (1986). Social foundations of thought and action: A social cognitive
theory. Englewood Cliffs, NJ: Prentice-Hall.
Bandura, A. (1997). Self-efficacy: The exercise of control. New York, NY: Freeman.
Barcikowski, R. S. (1981). Statistical power with group mean as the unit of analysis.
Journal of Educational Statistics,6, 267–285.
Beleche, T., Fairris, D., & Marks, M. (2012). Do course evaluations truly reflect
student learning? Evidence from an objectively graded post-test. Economics of
Education Review,31, 709–719.
Bohannon, J. (2014). Replication effort provokes praise—and ‘bullying’ charges.
Science,344, 788–789.
Bollen, K. A., & Pearl, J. (2013). Eight myths about causality and structural equation
models. In S. Morgan (Ed.), Handbook of causal analysis for social research
(pp. 301–328). New York, NY: Springer.
Boo, Z., D¨
ornyei, Z., & Ryan, S. (2015). L2 motivation research 2005–2014:
Understanding a publication surge and a changing landscape. System,55, 145–157.
Bouchard, T. J., Jr. (2004). Genetic influence on human psychological traits: A survey.
Current Directions in Psychological Science,13, 148–151.
Brandt, M., IJzerman, H., Dijksterhuis, A., Farach, F., Geller, J., Giner-Sorolla, R., . . .
van’t Veer, A. (2014). The replication recipe: What makes for a convincing
replication? Journal of Experimental Social Psychology,50, 217–224.
Brown, T. A. (2015). Confirmatory factor analysis for applied research (2nd ed.). New
York, NY: Guilford Press.
91 Language Learning 70:1, March 2020, pp. 48–102
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
Chan, L. (2014). Effects of an imagery training strategy on Chinese university
students’ possible second language selves and learning experiences. In K. Csiz´
er &
M. Magid (Eds.), The impact of self-concept on language learning (pp. 357–376).
Bristol, UK: Multilingual Matters.
er, K., & Kormos, J. (2009). Learning experiences, selves and motivated learning
behavior: A comparative analysis of structural models for Hungarian secondary and
university learners of English. In Z. D¨
ornyei & E. Usioda (Eds.), Motivation,
language identity and the L2 self (pp. 98–119). Bristol, UK: Multilingual Matters.
er, K., & Luk´
acs, G. (2010). The comparative analysis of motivation, attitudes and
selves: The case of English and German in Hungary. System,38, 1–13.
Courtney, M. G. R. (2013). Determining the number of factors to retain in EFA: Using
the SPSS R-menu v2.0 to make more judicious estimations. Practical Assessment,
Research & Evaluation,18, 1–15. Retrieved from
Davidov, E., Meuleman, B., Cieciuch, J., Schmidt, P., & Billiet, J. (2014).
Measurement equivalence in cross-national research. Annual Review of Sociology,
40, 55–75.
ornyei, Z. (2005). The psychology of the language learner: Individual differences in
second language acquisition. London, UK: Erlbaum.
ornyei, Z. (2009). The L2 motivational self system. In Z. D¨
ornyei & E. Ushioda
(Eds.), Motivation, language identity and the L2 self (pp. 9–42). Bristol, UK:
Multilingual Matters.
ornyei, Z. (2014). Future self-guides and vision. In K. Csiz´
er & M. Magid (Eds.),
The impact of self-concept on language learning (pp. 7–18). Bristol, UK:
Multilingual Matters.
ornyei, Z. (2019). Towards a better understanding of the L2 learning experience, the
Cinderella of the L2 motivational self system. Studies in Second Language
Learning and Teaching,9, 21–32.
ornyei, Z., & Chan, L. (2013). Motivation and vision: An analysis of future L2 self
images, sensory styles, and imagery capacity across two target languages. Language
Learning,63, 437–462.
ornyei, Z., Henry, A., & Muir, C. (2016). Motivational currents in language
learning: Frameworks for focused interventions. New York, NY: Routledge.
ornyei, Z., & Kubanyiova, M. (2014). Motivating learners, motivating teachers:
Building vision in the language classroom. Cambridge, UK: Cambridge University
ornyei, Z., & Ushioda, U. (2011). Teaching and researching motivation (2nd ed.).
Harlow, UK: Pearson.
Dunlosky, J., Rawson, K. A., Marsh, E. J., Nathan, M. J., & Willingham, D. T. (2013).
Improving students’ learning with effective learning techniques: Promising
Language Learning 70:1, March 2020, pp. 48–102 92
Hiver and Al-Hoorie Reexamining Vision in Second Language Motivation
directions from cognitive and educational psychology. Psychological Science in the
Public Interest,14, 4–58.
Eisinga, R., Grotenhuis, M., & Pelzer, B. (2013). The reliability of a two-item scale:
Pearson, Cronbach, or Spearman-Brown? International Journal of Public Health,
58, 637–642.
Felix, R. (2011). The impact of scale width on responses for multi-item, self-report
measures. Journal of Targeting, Measurement and Analysis for Marketing,19,
Fishbein, M., & Ajzen, I. (2010). Predicting and changing behavior: The reasoned
action approach. New York, NY: Psychology Press.
Fornell, C., & Larcker, D. F. (1981). Evaluating structural equation models with
unobservable variables and measurement error. Journal of Marketing Research,18,
Gardner, R. C. (1985). Social psychology and second language learning: The role of
attitudes and motivation. London, UK: Edward Arnold.
Gardner, R. C. (2010). Motivation and second language acquisition: The
socio-educational model. New York, NY: Peter Lang.
Gaskin, J. (2016). Stats Tools Package [Computer software]. Retrieved from
Gelman, A., & Stern, H. S. (2006). The difference between “significant” and “not
significant” is not itself statistically significant. The American Statistician,60,
Goodboy, A. K., & Kline, R. B. (2017). Statistical and practical concerns with
published communication research featuring structural equation modeling.
Communication Research Reports,34, 68–77.
Hadfield, J., & D¨
ornyei, Z. (2013). Motivating learning. New York, NY: Routledge.
Hair, J. F., Jr., Black, W. C., Babin, B. J., & Anderson,