Content uploaded by Philip Hiver

Author content

All content in this area was uploaded by Philip Hiver on Sep 21, 2021

Content may be subject to copyright.

Introduction

In recent years, scholarship in second language development (SLD) has

pivoted toward a transdisciplinary conceptualization that positions the

learner in context as central to the multifaceted, dynamic, and emergent

process of learning and development (The Douglas Fir Group, 2016).

This realization has increased momentum for a complexity turn (Dörnyei,

MacIntyre, & Henry, 2015), one which challenges researchers to adopt a

pragmatic transdisciplinary approach to research that is problem-oriented

in nature. This reorientation of the eld has strong roots in Zoltán Dörnyei’s

thinking (e.g., 2008, 2017). Many excellent volumes explore this topical

area in some depth (de Bot, Lowie, & Verspoor, 2005; Larsen-Freeman

& Cameron, 2008). These and other sources have introduced the new

generation of SLD scholars to the promise of a conceptual reorientation

around complex dynamic systems theory (CDST) in the eld of SLD (see

Larsen-Freeman, 2017, for one review). Yet, as with other disciplines’ earliest

experiences drawing on understandings from CDST, it is apparent that the

empirical work in our eld has yet to catch up with the rich conceptualizations

found in theoretical discussions. Simultaneously investigating “the ongoing

multiple inuences between environmental and learner factors in all their

componential complexity” (Dörnyei, 2009c, p. 229) remains an uphill

CHAPTER 16

Complexity Theory: From

Metaphors to Methodological

Advances

Ali H. Al-Hoorie and Phil Hiver

In A.H. Al-Hoorie & F. Szabó (Eds.), Researching language learning motivation: A concise guide. Bloomsbury.

RESEARCHING LANGUAGE LEARNING MOTIVATION

176

challenge both at the individual level of development and across broader

patterns or commonalities. One inhibitor to scholarship informed by

complexity theory has been applied linguists’ uncertainty regarding what

conducting actual empirical research entails. This has resulted in a lack of

consensus regarding which phenomena or questions merit examination,

how systematic investigation should be structured and conducted (e.g., with

regard to instrumentation and data collection), and how the results of this

research should be analyzed and interpreted.

In response, SLD researchers have begun to expand the toolbox of

methods available to conduct research in a dynamic vein (e.g., Dörnyei et

al., 2015; Hiver & Al-Hoorie, 2020b; Verspoor, de Bot, & Lowie, 2011).

Hiver and Al-Hoorie (2016) have also proposed a dynamic ensemble,

which details ways in which CDST constrains methodological choices

while at the same time encouraging innovation and diversication. Parallel

to these advances, however, the eld has developed the habit of drawing

heavily from a short list of statistical methods. In her survey of major SLD

journals, Lazaraton (2000) reported that the proportion of studies using

Pearson product-moment correlations and t-tests (and ANOVAs) holds the

lion’s share, approaching 90 percent. Over a decade later, this situation has

changed little (Plonsky, 2013), which raises the obvious question of whether

these statistical methods are optimal.

If some of the methods that are in widespread use across the eld are ill-

suited to studying the complex and dynamic realities of SLD and situating

these phenomena rmly in context, it would seem that an expansion of

the available methods is needed. In this chapter, we provide an overview

of some limitations of these methods, as well as alternative procedures

for overcoming these limitations. We focus on common quantitative

procedures, including correlation, multiple regression, t-test, and ANOVA.

We conclude by highlighting reliability issues likely to arise from the type of

methodological discussion we present here.

SLD research drawing from CDST is no stranger to these concerns (de

Bot & Larsen-Freeman, 2011), in that there is a need to broaden the range

of quantitative tools available to CDST research. Some work has already

been done, especially at the descriptive level (e.g., Verspoor et al., 2011).

In the words of David Byrne (2009), however, “the central project of any

science is the elucidation of causes that extend beyond the unique specic

instance” (p. 1). In this chapter, we adopt an inferential approach, which is

equally important.

Correlation

There are at least two technical limitations affecting the substantive

interpretation of (zero-order) correlations: one related to shared variance

COMPLEXITY THEORY: FROM METAPHORS TO METHODS

177

and the other to measurement error. These two points are both crucial to a

more sophisticated understanding of complex phenomena.

Shared Variance

A correlation between two variables can be thought of as “raw,” in that it

does not distinguish between shared and unique variance. Unique variance

is the proportion of variance that an independent variable uniquely

explains in a dependent variable, that is, after statistically partialing out

the contribution of the other independent variables in the model. On the

other hand, shared variance includes redundantly explained variance. When

investigating a theory, it is important to determine unique variance because

including shared variance may spuriously inate coefcients obtained,

and this may have substantive consequences. “Indeed, redundancy among

explanatory variables is the plague of our efforts to understand the causal

structure that underlies observations in the behavioral and social sciences”

(J. Cohen, Cohen, West, & Aiken, 2003, p. 76).

Using multiple regression, instead of correlation, can take shared

variance into account. In fact, regression is the origin of correlation, not the

other way around as some might assume, a misconception that probably

arises because correlation—for pedagogical reasons—is typically taught

rst in introductory statistics courses. Correlation can be described as the

relationship that does not regress (see Campbell & Kenny, 1999). In fact, even

the r that is used as a symbol for correlation actually stands for regression

(Campbell & Kenny, 1999; Miles & Banyard, 2007). The story does not

end with correlation. Both t-test and ANOVA are considered special cases

of regression (J. Cohen, 1968). That is, because there are only two groups in

a t-test (or more than two in the case of ANOVA), the statistical formulae

can be considerably simplied, but fundamentally they are still a regression

formula.

As an illustration of the advantage of using multiple regression over

correlation to account for shared variance, we simulated three variables

using R (R Core Team, 2014). The correlation matrix of the variables is in

Table 16.1. Imagine that the two independent variables (IV1 and IV2) are

motivation and attention in class, respectively, while the dependent variable

(DV) is nal achievement. As shown in Table 16.1, each variable predicts

achievement reasonably well according to conventional standards. Notice,

however, that the two IVs are also correlated with each other at .27 (which

is not uncommon in the social sciences). This suggests that individuals with

higher motivation also tend to pay more attention in class. This overlap (see

the dashed area in Figure 16.1A) is not adjusted for in the correlation of

each IV with the DV in Table 16.1 (hence, we call them raw). As a rule of

thumb, the variance of the DV explained by the two IVs would be expected

RESEARCHING LANGUAGE LEARNING MOTIVATION

178

to decrease by .272 (or .14 in total, depending on the internal structure of the

variables). The .46 and .32 coefcients are valid only when the correlation

between the two IVs is zero, as in Figure 16.1B.

In order to nd out the exact magnitude of this expected decrease, we

ran multiple regression on this simulated dataset. Figure 16.2 compares

the results with those from Table 16.1, which assumes that the two IVs

are correlated at zero. When the correlation between the two IVs is

.27, the coefcients are the same as those in Table 16.1 (.46 and .32).

However, when this correlation is zero, the coefcients shrink. IV2 shows

a marked drop, which could have considerable substantive consequences.

These coefcients are called rst-order correlations because one variable

is partialed out in each case, while those in Table 16.1 are zero-order

correlations. Thus, because of the insensitivity of correlation to shared

variance, researchers are at risk of obtaining inated correlation coefcients

and losing precision.

Although regression is most commonly used to model linear relationships,

nonlinear regression offers more exibility in many situations, especially for

CDST researchers. Statistician Simon Jackman makes this point clearly:

Table 16.1 Simulated correlations among three variables (N = 300)

IV1 IV2

IV1 –

IV2 .27 –

DV .46 .32

FIGURE 16.1 Overlap between two independent variables (A) versus no overlap (B).

COMPLEXITY THEORY: FROM METAPHORS TO METHODS

179

if there is one statistical model that we expect Ph.D. students to understand,

it is the regression model … No class on the linear regression model is

complete without showing students how logarithmic transformations,

polynomials, taking reciprocals or square roots, or interactions can be

used to estimate non-linear relationships between y and predictors of

interest.

(Jackman, 2009, pp. 100–2)

With application to modeling dynamic change and development, Seber

and Wild (1989) explain that the decision to select a linear or nonlinear

model is based on theoretical considerations or observation of nonlinear

behavior, such as visual inspection or formal tests. Especially for extended

timescales, nonlinear modeling tends to be more appropriate (e.g., Miles

& Shevlin, 2001, p. 136). Rodgers, Rowe, and Buster (1998) also argue

that linear models may be appropriate in the exploratory stages of research

when little is known about the specic mechanism involved. This is not a

cure-all data analysis technique, however, and as the authors point out: “We

emphasize that nonlinear dynamic models should not generally substitute

for traditional linear analysis. They have different goals, t different types of

data, and result in different interpretations” (Rodgers et al., 1998, p. 1097).

Measurement Error

The second limitation of correlation is that it does not account for

measurement error. This limitation also applies to regression analysis. Both

of these procedures assume that the variables in the model were measured

perfectly. This rarely happens, except perhaps in clear-cut variables such

as gender. In most cases, isomorphic parallels between a reality and

a measurement of that reality are limited, which introduces a level of

FIGURE 16.2 Comparison of the regression coefcients when the independent

variables are correlated at .00 versus at .27.

RESEARCHING LANGUAGE LEARNING MOTIVATION

180

unreliability into the estimation. For example, if the reliability of a scale

is .70, which is commonly thought to be satisfactory (though see below),

there is still .30 unreliability (or .51 error variance)1 that goes unaccounted

for. In conventional practice, whether the reliability is .80 or .60 does not

matter in subsequent analyses; reliability is used only to determine whether

the measure is “good enough” to proceed with the analysis.

However, taking this unreliability into account would lead to more

precision and more statistical power. This can be performed through latent

variable modeling, which requires specialized software such as Amos

and Mplus, but which is nonetheless gaining traction rapidly in various

disciplines. In order to illustrate the effect of reliability on the results, we ran

the simulated data generated above in Amos three times, each time setting

the reliability at a different level: .95, .80, or .65. This allows examining

how the results are inuenced by the different levels of reliability. Reliabilities

were set by xing the error variance of the latent variable using the formula:

SD2scale × (1 – rho) (Brown, 2015).

The results are shown in Figure 16.3. When reliability is very high (.95),

the results are very similar to those from multiple regression (see Figure

FIGURE 16.3 Comparison of standardized structural coefcients at three levels of

reliability: .95/.80/.65. Squares and ovals denote observed and unobserved variables,

respectively. Note that this convention was not used in Figure 16.2 in order to avoid

suggesting that latent variable modeling was performed there (see J. Cohen et al.,

2003, p. 66, for a similar approach). DV = dependent variable, IV = independent

variable, LV = latent variable.

COMPLEXITY THEORY: FROM METAPHORS TO METHODS

181

16.2, correlation at .27), again demonstrating that multiple regression

assumes that the variables are measured with perfect reliability. However,

as the reliability decreases, the two sets of results diverge. The coefcient

increases in the case of LV1, but decreases in the case of LV2. Bollen (1989)

has demonstrated that in the case of correlation (or simple regression),

measurement error attenuates (or underestimates) the relationship, but in

the case of multiple regression the bias may be upward or downward (as in

Figure 16.3). With a reliability of .65, the coefcient for LV2 drops to .19

(recall that Table 16.1 reports a coefcient of .32!). A drop of this magnitude

would most likely have signicant substantive consequences.

This example illustrates that reliability is more than merely a sign to

telling the researcher whether the measure is good enough to proceed with

further analyses. Incorporating reliability can increase the precision of the

estimate, which sometimes leads to a dramatic change in the results. This

strategy works best when the measures are already reasonably reliable, and

so researchers should still aim to obtain measures with good reliability and

then adjust for the remaining unreliability. On the other hand, since they

do not account for reliability, correlation and regression risk obtaining

inaccurate results. In very early phases of a line of research, correlation

may be informative. As knowledge advances, greater use of latent variable

modeling would be more appropriate.

ANOVA and t-test

The same problem described above applies to ANOVA and t-test, in that

these procedures also assume that the variables are measured without error.

Again, measurement error can be accounted for through latent variable

modeling. The latent equivalent of t-test and ANOVA is multi-group

conrmatory factor analysis (MG-CFA, e.g. Brown, 2015). In this case,

instead of using the observed means for comparing groups, the latent means

are used after adjusting for measurement error. In the case of multi-item

scales, using the observed mean is justied only when each item contributes

equally to the latent mean, which again requires perfect reliability.

A second problem with t-test and ANOVA is that, unlike MG-CFA, they

do not allow testing for measurement invariance, that is, whether the two

(or more) groups interpret the items in a conceptually similar manner (Hiver

& Al-Hoorie, 2020a). For example, when comparing two groups, a mean

difference might be because one group actually has a higher latent score,

but it might also be because the two groups simply understand the items

differently. That the groups understand the items in a similar way is “a

logical prerequisite” (Vandenberg & Lance, 2000, p. 9) to any meaningful

interpretation of this difference. Items could be interpreted differently in

cross-cultural and cross-age comparisons. When measurement invariance

does not hold, it means that the measure needs renement (see, e.g., Davidov,

RESEARCHING LANGUAGE LEARNING MOTIVATION

182

Meuleman, Cieciuch, Schmidt, & Billiet, 2014). The same is true in cross-

time comparisons, as understanding of abstract notions might change over

time. The measurement invariance assumption can be investigated both in

classical test theory and in modern test theory (for accessible applications,

see Davidov, 2008; Engelhard, 2013; Sass, 2011; Steinmetz, Schmidt, Tina-

Booh, Wieczorek, & Schwartz, 2009).

Reliability

Last but not least, the increasing interest in CDST may encourage innovations

in instruments of measurement, which leads to a greater need to examine the

validity of scores derived from those new instruments. There is no reason to

assume that validity issues are irrelevant to CDST-inspired instruments, or

that issues of complexity and dynamic change nullify conventional notions

of validity (Hiver & Al-Hoorie, 2020b). Indeed, “the ability to use scores

for an intended research purpose is only justiable if sufcient evidence

exists for the valid interpretation of the test scores” (Purpura, Brown, &

Schoonen, 2015, p. 41, original emphasis).

In the discussion above, we mentioned that many researchers use reliability

to determine whether their measures are good enough to proceed with the

analysis. One widespread rule of thumb is to recommend a reliability of .70.

The origin of this rule of thumb is that .70 accounts for 50 percent of the

variance (or more exactly .702 = .49). Clearly, accounting for only about 50

percent of the variance is not a remarkable feat. In fact, Lance, Butts, and

Michels (2006) describe this cutoff recommendation as an “urban legend,”

a myth that has persisted over the decades. According to psychometricians

Nunnally and Bernstein (1994, p. 265) a reliability of .70 is considered as

“only modest” (p. 265). Instead, acceptable reliability “should not be below

.80 for widely used scales” (Carmines & Zeller, 1979, p. 51).

Another important consideration is that even a reliability as high as .80

is suitable only for group-level analysis. However, CDST researchers are

typically interested in individual-level applications. A reliability of .80 would

introduce a high margin of error for individual-level applications. As Nunnally

and Bernstein explain,

We have noted that the standard error of measurement is almost one-third

as large as the overall standard deviation of test scores even when the

reliability is .90. If important decisions are made with respect to specic

test scores, a reliability of .90 is the bare minimum, and a reliability of .95

should be considered the desirable standard.

(Nunnally & Bernstein, 1994, p. 265)

Another advantage of higher levels of reliability is the ability to split

samples into subgroups. It is not uncommon that researchers need to create

COMPLEXITY THEORY: FROM METAPHORS TO METHODS

183

subgroups of their participants. The reliability of the instrument can decide

how many subgroups can be created. Table 16.2 presents a breakdown of the

relationship between reliability level and district subgroups (Fisher, 1992).

Based on the level of reliability, therefore, the researcher can make an

informed and more accurate decision about how many subgroups (or strata)

can be meaningfully created. This, in turn, highlights the need for more

attention and effort to be focused on instrument construction and validation

in our eld. Apparently, the myth that .70 reliability (or just below it) is

satisfactory has led to a sense of complacency and lack of interest in rening

instruments in our eld.

Conclusion

Throughout this chapter, our message has been that quantitative methodology

should play a vital role in CDST research. It is high time we put to rest the

persistent myth that quantitative data elicitation and analyses are poorly

suited to of CDST research: This is categorically not true, although there

are quantitative designs that are much more suited to investigating dynamic

change and interconnectedness than others (e.g., Molenaar, Lerner, &

Newell, 2014; Valsiner, Molenaar, Lyra, & Chaudhary, 2009). Furthermore,

it would be misleading to claim that qualitative designs are inherently more

compatible with CDST: They are not. From the perspective of CDST’s

philosophy of science, advocating a qualitative-only approach is neither

defensible nor pragmatic for the range of phenomena that necessitate

investigation (Hiver & Al-Hoorie, 2016, 2020b). Qualitative research

designs do not by themselves guarantee a more complex and dynamic

perspective for research, particularly if the research design is not inherently

connected to or informed by the conceptual framework of complexity

(Dörnyei et al., 2015). As others have noted, the selection of methods for

Table 16.2 Reliability levels and number of distinct subgroups

Reliability Subgroups

.50 1

.70 2

.80 3

.90 4

.94 5

.96 7

.97 8

.98 9

RESEARCHING LANGUAGE LEARNING MOTIVATION

184

complexity-based inquiry in SLD does not suggest an either/or choice, and

we do not, of course, wish to understate the value of qualitative methods

that allow nely grained observations, having both reviewed and used some

of these methods ourselves rsthand.

Importantly, the potential of quantitative analyses for CDST research

extends past the mundane comparisons of groups and measurements of

linear relationships into the more compelling areas of identifying underlying

structure, accounting for variation at different levels, discerning temporal

processes and events, quantifying trends, predicting group membership,

applying spatial analysis, and studying networked phenomena nested in

contexts. We have written on these and other methods in book-length form

elsewhere (Hiver & Al-Hoorie, 2020b) and encourage interested readers

to explore these to the extent they feel comfortable with such innovative

methods. Provided that research applying a dynamic perspective ensures

that, from the earliest stages of design, there is no disconnect between the

conceptual framework of CDST and the statistical analyses used, there is

every reason to believe that quantitative methods will serve to advance and

accelerate our understanding of SLD.