ChapterPDF Available

Complexity Theory: From Metaphors to Methodological Advances



Content may be subject to copyright.
In recent years, scholarship in second language development (SLD) has
pivoted toward a transdisciplinary conceptualization that positions the
learner in context as central to the multifaceted, dynamic, and emergent
process of learning and development (The Douglas Fir Group, 2016).
This realization has increased momentum for a complexity turn (Dörnyei,
MacIntyre, & Henry, 2015), one which challenges researchers to adopt a
pragmatic transdisciplinary approach to research that is problem-oriented
in nature. This reorientation of the eld has strong roots in Zoltán Dörnyei’s
thinking (e.g., 2008, 2017). Many excellent volumes explore this topical
area in some depth (de Bot, Lowie, & Verspoor, 2005; Larsen-Freeman
& Cameron, 2008). These and other sources have introduced the new
generation of SLD scholars to the promise of a conceptual reorientation
around complex dynamic systems theory (CDST) in the eld of SLD (see
Larsen-Freeman, 2017, for one review). Yet, as with other disciplines’ earliest
experiences drawing on understandings from CDST, it is apparent that the
empirical work in our eld has yet to catch up with the rich conceptualizations
found in theoretical discussions. Simultaneously investigating “the ongoing
multiple inuences between environmental and learner factors in all their
componential complexity” (Dörnyei, 2009c, p. 229) remains an uphill
Complexity Theory: From
Metaphors to Methodological
Ali H. Al-Hoorie and Phil Hiver
In A.H. Al-Hoorie & F. Szabó (Eds.), Researching language learning motivation: A concise guide. Bloomsbury.
challenge both at the individual level of development and across broader
patterns or commonalities. One inhibitor to scholarship informed by
complexity theory has been applied linguists’ uncertainty regarding what
conducting actual empirical research entails. This has resulted in a lack of
consensus regarding which phenomena or questions merit examination,
how systematic investigation should be structured and conducted (e.g., with
regard to instrumentation and data collection), and how the results of this
research should be analyzed and interpreted.
In response, SLD researchers have begun to expand the toolbox of
methods available to conduct research in a dynamic vein (e.g., Dörnyei et
al., 2015; Hiver & Al-Hoorie, 2020b; Verspoor, de Bot, & Lowie, 2011).
Hiver and Al-Hoorie (2016) have also proposed a dynamic ensemble,
which details ways in which CDST constrains methodological choices
while at the same time encouraging innovation and diversication. Parallel
to these advances, however, the eld has developed the habit of drawing
heavily from a short list of statistical methods. In her survey of major SLD
journals, Lazaraton (2000) reported that the proportion of studies using
Pearson product-moment correlations and t-tests (and ANOVAs) holds the
lion’s share, approaching 90 percent. Over a decade later, this situation has
changed little (Plonsky, 2013), which raises the obvious question of whether
these statistical methods are optimal.
If some of the methods that are in widespread use across the eld are ill-
suited to studying the complex and dynamic realities of SLD and situating
these phenomena rmly in context, it would seem that an expansion of
the available methods is needed. In this chapter, we provide an overview
of some limitations of these methods, as well as alternative procedures
for overcoming these limitations. We focus on common quantitative
procedures, including correlation, multiple regression, t-test, and ANOVA.
We conclude by highlighting reliability issues likely to arise from the type of
methodological discussion we present here.
SLD research drawing from CDST is no stranger to these concerns (de
Bot & Larsen-Freeman, 2011), in that there is a need to broaden the range
of quantitative tools available to CDST research. Some work has already
been done, especially at the descriptive level (e.g., Verspoor et al., 2011).
In the words of David Byrne (2009), however, “the central project of any
science is the elucidation of causes that extend beyond the unique specic
instance” (p. 1). In this chapter, we adopt an inferential approach, which is
equally important.
There are at least two technical limitations affecting the substantive
interpretation of (zero-order) correlations: one related to shared variance
and the other to measurement error. These two points are both crucial to a
more sophisticated understanding of complex phenomena.
Shared Variance
A correlation between two variables can be thought of as “raw,” in that it
does not distinguish between shared and unique variance. Unique variance
is the proportion of variance that an independent variable uniquely
explains in a dependent variable, that is, after statistically partialing out
the contribution of the other independent variables in the model. On the
other hand, shared variance includes redundantly explained variance. When
investigating a theory, it is important to determine unique variance because
including shared variance may spuriously inate coefcients obtained,
and this may have substantive consequences. “Indeed, redundancy among
explanatory variables is the plague of our efforts to understand the causal
structure that underlies observations in the behavioral and social sciences”
(J. Cohen, Cohen, West, & Aiken, 2003, p. 76).
Using multiple regression, instead of correlation, can take shared
variance into account. In fact, regression is the origin of correlation, not the
other way around as some might assume, a misconception that probably
arises because correlation—for pedagogical reasons—is typically taught
rst in introductory statistics courses. Correlation can be described as the
relationship that does not regress (see Campbell & Kenny, 1999). In fact, even
the r that is used as a symbol for correlation actually stands for regression
(Campbell & Kenny, 1999; Miles & Banyard, 2007). The story does not
end with correlation. Both t-test and ANOVA are considered special cases
of regression (J. Cohen, 1968). That is, because there are only two groups in
a t-test (or more than two in the case of ANOVA), the statistical formulae
can be considerably simplied, but fundamentally they are still a regression
As an illustration of the advantage of using multiple regression over
correlation to account for shared variance, we simulated three variables
using R (R Core Team, 2014). The correlation matrix of the variables is in
Table 16.1. Imagine that the two independent variables (IV1 and IV2) are
motivation and attention in class, respectively, while the dependent variable
(DV) is nal achievement. As shown in Table 16.1, each variable predicts
achievement reasonably well according to conventional standards. Notice,
however, that the two IVs are also correlated with each other at .27 (which
is not uncommon in the social sciences). This suggests that individuals with
higher motivation also tend to pay more attention in class. This overlap (see
the dashed area in Figure 16.1A) is not adjusted for in the correlation of
each IV with the DV in Table 16.1 (hence, we call them raw). As a rule of
thumb, the variance of the DV explained by the two IVs would be expected
to decrease by .272 (or .14 in total, depending on the internal structure of the
variables). The .46 and .32 coefcients are valid only when the correlation
between the two IVs is zero, as in Figure 16.1B.
In order to nd out the exact magnitude of this expected decrease, we
ran multiple regression on this simulated dataset. Figure 16.2 compares
the results with those from Table 16.1, which assumes that the two IVs
are correlated at zero. When the correlation between the two IVs is
.27, the coefcients are the same as those in Table 16.1 (.46 and .32).
However, when this correlation is zero, the coefcients shrink. IV2 shows
a marked drop, which could have considerable substantive consequences.
These coefcients are called rst-order correlations because one variable
is partialed out in each case, while those in Table 16.1 are zero-order
correlations. Thus, because of the insensitivity of correlation to shared
variance, researchers are at risk of obtaining inated correlation coefcients
and losing precision.
Although regression is most commonly used to model linear relationships,
nonlinear regression offers more exibility in many situations, especially for
CDST researchers. Statistician Simon Jackman makes this point clearly:
Table 16.1 Simulated correlations among three variables (N = 300)
IV1 –
IV2 .27
DV .46 .32
FIGURE 16.1 Overlap between two independent variables (A) versus no overlap (B).
if there is one statistical model that we expect Ph.D. students to understand,
it is the regression model … No class on the linear regression model is
complete without showing students how logarithmic transformations,
polynomials, taking reciprocals or square roots, or interactions can be
used to estimate non-linear relationships between y and predictors of
(Jackman, 2009, pp. 100–2)
With application to modeling dynamic change and development, Seber
and Wild (1989) explain that the decision to select a linear or nonlinear
model is based on theoretical considerations or observation of nonlinear
behavior, such as visual inspection or formal tests. Especially for extended
timescales, nonlinear modeling tends to be more appropriate (e.g., Miles
& Shevlin, 2001, p. 136). Rodgers, Rowe, and Buster (1998) also argue
that linear models may be appropriate in the exploratory stages of research
when little is known about the specic mechanism involved. This is not a
cure-all data analysis technique, however, and as the authors point out: “We
emphasize that nonlinear dynamic models should not generally substitute
for traditional linear analysis. They have different goals, t different types of
data, and result in different interpretations” (Rodgers et al., 1998, p. 1097).
Measurement Error
The second limitation of correlation is that it does not account for
measurement error. This limitation also applies to regression analysis. Both
of these procedures assume that the variables in the model were measured
perfectly. This rarely happens, except perhaps in clear-cut variables such
as gender. In most cases, isomorphic parallels between a reality and
a measurement of that reality are limited, which introduces a level of
FIGURE 16.2 Comparison of the regression coefcients when the independent
variables are correlated at .00 versus at .27.
unreliability into the estimation. For example, if the reliability of a scale
is .70, which is commonly thought to be satisfactory (though see below),
there is still .30 unreliability (or .51 error variance)1 that goes unaccounted
for. In conventional practice, whether the reliability is .80 or .60 does not
matter in subsequent analyses; reliability is used only to determine whether
the measure is “good enough” to proceed with the analysis.
However, taking this unreliability into account would lead to more
precision and more statistical power. This can be performed through latent
variable modeling, which requires specialized software such as Amos
and Mplus, but which is nonetheless gaining traction rapidly in various
disciplines. In order to illustrate the effect of reliability on the results, we ran
the simulated data generated above in Amos three times, each time setting
the reliability at a different level: .95, .80, or .65. This allows examining
how the results are inuenced by the different levels of reliability. Reliabilities
were set by xing the error variance of the latent variable using the formula:
SD2scale × (1 – rho) (Brown, 2015).
The results are shown in Figure 16.3. When reliability is very high (.95),
the results are very similar to those from multiple regression (see Figure
FIGURE 16.3 Comparison of standardized structural coefcients at three levels of
reliability: .95/.80/.65. Squares and ovals denote observed and unobserved variables,
respectively. Note that this convention was not used in Figure 16.2 in order to avoid
suggesting that latent variable modeling was performed there (see J. Cohen et al.,
2003, p. 66, for a similar approach). DV = dependent variable, IV = independent
variable, LV = latent variable.
16.2, correlation at .27), again demonstrating that multiple regression
assumes that the variables are measured with perfect reliability. However,
as the reliability decreases, the two sets of results diverge. The coefcient
increases in the case of LV1, but decreases in the case of LV2. Bollen (1989)
has demonstrated that in the case of correlation (or simple regression),
measurement error attenuates (or underestimates) the relationship, but in
the case of multiple regression the bias may be upward or downward (as in
Figure 16.3). With a reliability of .65, the coefcient for LV2 drops to .19
(recall that Table 16.1 reports a coefcient of .32!). A drop of this magnitude
would most likely have signicant substantive consequences.
This example illustrates that reliability is more than merely a sign to
telling the researcher whether the measure is good enough to proceed with
further analyses. Incorporating reliability can increase the precision of the
estimate, which sometimes leads to a dramatic change in the results. This
strategy works best when the measures are already reasonably reliable, and
so researchers should still aim to obtain measures with good reliability and
then adjust for the remaining unreliability. On the other hand, since they
do not account for reliability, correlation and regression risk obtaining
inaccurate results. In very early phases of a line of research, correlation
may be informative. As knowledge advances, greater use of latent variable
modeling would be more appropriate.
ANOVA and t-test
The same problem described above applies to ANOVA and t-test, in that
these procedures also assume that the variables are measured without error.
Again, measurement error can be accounted for through latent variable
modeling. The latent equivalent of t-test and ANOVA is multi-group
conrmatory factor analysis (MG-CFA, e.g. Brown, 2015). In this case,
instead of using the observed means for comparing groups, the latent means
are used after adjusting for measurement error. In the case of multi-item
scales, using the observed mean is justied only when each item contributes
equally to the latent mean, which again requires perfect reliability.
A second problem with t-test and ANOVA is that, unlike MG-CFA, they
do not allow testing for measurement invariance, that is, whether the two
(or more) groups interpret the items in a conceptually similar manner (Hiver
& Al-Hoorie, 2020a). For example, when comparing two groups, a mean
difference might be because one group actually has a higher latent score,
but it might also be because the two groups simply understand the items
differently. That the groups understand the items in a similar way is “a
logical prerequisite” (Vandenberg & Lance, 2000, p. 9) to any meaningful
interpretation of this difference. Items could be interpreted differently in
cross-cultural and cross-age comparisons. When measurement invariance
does not hold, it means that the measure needs renement (see, e.g., Davidov,
Meuleman, Cieciuch, Schmidt, & Billiet, 2014). The same is true in cross-
time comparisons, as understanding of abstract notions might change over
time. The measurement invariance assumption can be investigated both in
classical test theory and in modern test theory (for accessible applications,
see Davidov, 2008; Engelhard, 2013; Sass, 2011; Steinmetz, Schmidt, Tina-
Booh, Wieczorek, & Schwartz, 2009).
Last but not least, the increasing interest in CDST may encourage innovations
in instruments of measurement, which leads to a greater need to examine the
validity of scores derived from those new instruments. There is no reason to
assume that validity issues are irrelevant to CDST-inspired instruments, or
that issues of complexity and dynamic change nullify conventional notions
of validity (Hiver & Al-Hoorie, 2020b). Indeed, “the ability to use scores
for an intended research purpose is only justiable if sufcient evidence
exists for the valid interpretation of the test scores” (Purpura, Brown, &
Schoonen, 2015, p. 41, original emphasis).
In the discussion above, we mentioned that many researchers use reliability
to determine whether their measures are good enough to proceed with the
analysis. One widespread rule of thumb is to recommend a reliability of .70.
The origin of this rule of thumb is that .70 accounts for 50 percent of the
variance (or more exactly .702 = .49). Clearly, accounting for only about 50
percent of the variance is not a remarkable feat. In fact, Lance, Butts, and
Michels (2006) describe this cutoff recommendation as an “urban legend,”
a myth that has persisted over the decades. According to psychometricians
Nunnally and Bernstein (1994, p. 265) a reliability of .70 is considered as
“only modest” (p. 265). Instead, acceptable reliability “should not be below
.80 for widely used scales” (Carmines & Zeller, 1979, p. 51).
Another important consideration is that even a reliability as high as .80
is suitable only for group-level analysis. However, CDST researchers are
typically interested in individual-level applications. A reliability of .80 would
introduce a high margin of error for individual-level applications. As Nunnally
and Bernstein explain,
We have noted that the standard error of measurement is almost one-third
as large as the overall standard deviation of test scores even when the
reliability is .90. If important decisions are made with respect to specic
test scores, a reliability of .90 is the bare minimum, and a reliability of .95
should be considered the desirable standard.
(Nunnally & Bernstein, 1994, p. 265)
Another advantage of higher levels of reliability is the ability to split
samples into subgroups. It is not uncommon that researchers need to create
subgroups of their participants. The reliability of the instrument can decide
how many subgroups can be created. Table 16.2 presents a breakdown of the
relationship between reliability level and district subgroups (Fisher, 1992).
Based on the level of reliability, therefore, the researcher can make an
informed and more accurate decision about how many subgroups (or strata)
can be meaningfully created. This, in turn, highlights the need for more
attention and effort to be focused on instrument construction and validation
in our eld. Apparently, the myth that .70 reliability (or just below it) is
satisfactory has led to a sense of complacency and lack of interest in rening
instruments in our eld.
Throughout this chapter, our message has been that quantitative methodology
should play a vital role in CDST research. It is high time we put to rest the
persistent myth that quantitative data elicitation and analyses are poorly
suited to of CDST research: This is categorically not true, although there
are quantitative designs that are much more suited to investigating dynamic
change and interconnectedness than others (e.g., Molenaar, Lerner, &
Newell, 2014; Valsiner, Molenaar, Lyra, & Chaudhary, 2009). Furthermore,
it would be misleading to claim that qualitative designs are inherently more
compatible with CDST: They are not. From the perspective of CDST’s
philosophy of science, advocating a qualitative-only approach is neither
defensible nor pragmatic for the range of phenomena that necessitate
investigation (Hiver & Al-Hoorie, 2016, 2020b). Qualitative research
designs do not by themselves guarantee a more complex and dynamic
perspective for research, particularly if the research design is not inherently
connected to or informed by the conceptual framework of complexity
(Dörnyei et al., 2015). As others have noted, the selection of methods for
Table 16.2 Reliability levels and number of distinct subgroups
Reliability Subgroups
.50 1
.70 2
.80 3
.90 4
.94 5
.96 7
.97 8
.98 9
complexity-based inquiry in SLD does not suggest an either/or choice, and
we do not, of course, wish to understate the value of qualitative methods
that allow nely grained observations, having both reviewed and used some
of these methods ourselves rsthand.
Importantly, the potential of quantitative analyses for CDST research
extends past the mundane comparisons of groups and measurements of
linear relationships into the more compelling areas of identifying underlying
structure, accounting for variation at different levels, discerning temporal
processes and events, quantifying trends, predicting group membership,
applying spatial analysis, and studying networked phenomena nested in
contexts. We have written on these and other methods in book-length form
elsewhere (Hiver & Al-Hoorie, 2020b) and encourage interested readers
to explore these to the extent they feel comfortable with such innovative
methods. Provided that research applying a dynamic perspective ensures
that, from the earliest stages of design, there is no disconnect between the
conceptual framework of CDST and the statistical analyses used, there is
every reason to believe that quantitative methods will serve to advance and
accelerate our understanding of SLD.
... Again, the distribution of results from individual studies is not unlike that of data points within one study. These findings make us wonder about the interpretive value of individual studies, whether initial or replicated, especially when reliability is at the so-called acceptable level of .70 (see also Al-Hoorie & Hiver, 2022). ...
Full-text available
WATCH THE VIDEO: - In contemporary methodological thinking, replication holds a central place. However, relatively little attention has been paid to replication in the context of complex dynamic systems theory (CDST), perhaps due to uncertainty regarding the epistemology-methodology match between these domains. In this paper, we explore the place of replication in relation to open systems and argue that three conditions must be in place for replication research to be effective: results interpretability, theoretical maturity, and termino-logical precision. We consider whether these conditions are part of the applied linguistics body of work, and then propose a more comprehensive framework centering on what we call SUBSTANTIATION RESEARCH, only one aspect of which is replication. Using this framework, we discuss three approaches to dealing with replication from a CDST perspective theory. These approaches are moving from a representing to an intervening mindset, from a comprehensive theory to a mini-theory mindset, and from individual findings to a cumulative mindset.
Full-text available
Network analysis is a method used to explore the structural relationships between people or organizations, and more recently between psychological constructs. Network analysis is a novel technique that can be used to model psychological constructs that influence language learning as complex systems, with longitudinal data, or cross-sectional data. The majority of complex dynamic systems theory (CDST) research in the field of second language acquisition (SLA) to date has been time-intensive, with a focus on analyzing intraindividual variation with dense longitudinal data collection. The question of how to model systems from a structural perspective using relation-intensive methods is an underexplored dimension of CDST research in applied linguistics. To expand our research agenda, we highlight the potential that psychological networks have for studying individual differences in language learning. We provide two empirical examples of network models using cross-sectional datasets that are publicly available online. We believe that this methodology can complement time-intensive approaches and that it has the potential to contribute to the development of new dimensions of CDST research in applied linguistics.
ResearchGate has not been able to resolve any references for this publication.