ArticlePDF Available

Measurement and Sampling Recommendations for L2 Flipped Learning Experiments: A Bottom-Up Methodological Synthesis

Authors:

Abstract and Figures

With the advent of COVID-19, learning has transitioned from the classroom to online platforms (Tang et al., 2020). Interestingly, this forced migration to online teaching has coincided with the rise of the popularity of flipped learning in education in general (Låg & Sæle, 2019) and ELT and L2 in particular (Mehring, 2018). Flipped learning, in a general sense, inverts the traditional learning paradigm by presenting new content to students before and outside of the class with subsequent class time used for interacting and engaging with said content (for extensive consideration of L2 flipped learning; see Mehring, 2018). Unsurprisingly, recent research has investigated the potential of flipped classrooms against the backdrop of online teaching in response to COVID-19. Tang and colleagues (2020), for instance, investigated Chinese students’ perceptions of flipped learning in online environments vis-à-vis traditional methods. It is likewise reasonable to expect L2 researchers to enhance their already substantial interest in flipped learning in response to the current migration to online teaching. What is more is that even after COVID-19 subsides, there is no reason to assume the L2 academic community’s interest in flipped learning will wane as interest in it had been growing before the pandemic (see e.g., Bonyadi, 2018).
Content may be subject to copyright.
THE JOURNAL OF ASIA TEFL
Vol. 18, No. 2, Summer 2021, 682-692
http://dx.doi.org/10.18823/asiatefl.2021.18.2.23.682
2021 AsiaTEFL All rights reserved 682
The Journal of Asia TEFL
http://journal.asiatefl.org/
e-ISSN 2466-1511 © 2004 AsiaTEFL.org. All rights reserved.
Measurement and Sampling Recommendations for L2 Flipped
Learning Experiments: A Bottom-Up Methodological Synthesis
Joseph P. Vitta
Kyushu University, Fukuoka, Japan
Ali H. Al-Hoorie
Royal Commission for Jubail and Yanbu, Saudi Arabia
Introduction
With the advent of COVID-19, learning has transitioned from the classroom to online platforms (Tang
et al., 2020). Interestingly, this forced migration to online teaching has coincided with the rise of the
popularity of flipped learning in education in general (Låg & Sæ le, 2019) and ELT and L2 in particular
(Mehring, 2018). Flipped learning, in a general sense, inverts the traditional learning paradigm by
presenting new content to students before and outside of the class with subsequent class time used for
interacting and engaging with said content (for extensive consideration of L2 flipped learning; see
Mehring, 2018). Unsurprisingly, recent research has investigated the potential of flipped classrooms
against the backdrop of online teaching in response to COVID-19. Tang and colleagues (2020), for
instance, investigated Chinese students’ perceptions of flipped learning in online environments vis-à-vis
traditional methods. It is likewise reasonable to expect L2 researchers to enhance their already substantial
interest in flipped learning in response to the current migration to online teaching. What is more is that
even after COVID-19 subsides, there is no reason to assume the L2 academic community’s interest in
flipped learning will wane as interest in it had been growing before the pandemic (see e.g., Bonyadi,
2018).
This growing interest in flipped learning and related areas such as blended learning (see Mahmud,
2018) within L2 contexts underpins this current study, a methodological synthesis. We thus present a
focused systematic review of methods shortcomings in past L2 flipped learning experimental and quasi-
experimental (e.g., no control group) designs (‘experimental’ is used hereafter as a general term). From
an earlier meta-analysis (Vitta & Al-Hoorie, 2020) conducted by the researchers, several consequential
and addressable methodological issues were observed that future inquiries could address to increase the
rigor and trustworthiness of their findings. Given the waxing focus L2 teachers and researchers are
placing on flipped learning, highlighting and observing the frequency of such issues are worthwhile as it
provides a point of reference and a road map for improving future L2 flipped research. As L2 quantitative
inquiries are currently undergoing a state of methodological reform (Gass, Loewen, & Plonsky, 2020), we
must state clearly that we are not criticizing past inquiries, and this inquiry was conducted with the
intention of improving future research into the effectiveness of flipped learning.
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 683
Selection of Methodological Issues
The methodological issues were selected via a bottom-up process during the coding of effects
presented in the earlier meta-analysis (Vitta & Al-Hoorie, 2020). To be selected, an issue had to be both
consequential and addressable with the researchers’ observing the issue among several (as opposed to
only a few) reports in the pool. Failure to report effect sizes was omitted from this review, for instance,
because most reports, especially those published recently, had overtly reported them. Consequential
implies that not addressing the issue could bias the results and/or the trust that readers would place in the
results. Effect size reporting would exemplify a consequential issue given the call for researchers to report
effect sizes even when nonsignificant (see Al-Hoorie & Vitta, 2019). Similarly, checking inter-rater
reliability with a simple percentage of agreement instead of Cohen’s kappa would be classified as
inconsequential because the former metric still respects the expectation that a subsequent independent
coding (McHugh, 2012) must establish reliability for nominal and/or ranked judgements. The recent call
for L2 experimental designs to replace fixed-effect models (e.g., ANOVA) with mixed-effect models (see
Linck & Cunnings, 2015) provides an example of a less easily addressable issue for the average L2
researcher currently. While use of these models are growing in the field, they do require use of
specialized programs and standard research handbooks and corresponding tools (e.g., SPSS; Field, 2018)
are still mainly supporting and detailing fixed-effect models and therefore we do not discuss this point
further in this article. At the end of this process, five issues, subsumed under measurement and sampling,
emerged as being somewhat widespread among the reports, consequential, and addressable:
Measurement Recommendations
1) assessing the reliability of dependent variable measurements
2) administering pretests
Sampling Recommendations
3) conducting a priori power analysis
4) describing the randomized assignment procedure explicitly
5) using multi-site samples
The consequence of the selected issues becomes clear when reviewing relevant literature, especially L2
research syntheses of quantitative inquiries (e.g., Plonsky, 2013, 2014). In what follows we review each
recommendation in turn against past considerations.
Reliability
As stated in the psychometric literature, reliability is a prerequisite of validity. In guidance for L2
researchers, Al-Hoorie and Vitta (2019) argued that reliability, even of trusted instruments, must always
be reported. In other words, measurements that are not reliable cannot be valid. It is unsurprising that
reliability has been of interest to L2 methodological syntheses with divergent findings observed. Plonsky
and Gass (2011) observed 64% of reports reporting reliability while Plonsky (2014) observed 50% in
reports published in the 2000s. The different scope and foci of the inquiries most likely accounts for such
discrepancies.
Pretest Use
As highlighted by methodologists (e.g., Kuehl, 2000), experiments in the strictest sense do not have to
include a pretest, a measurement of the outcome/dependent variable before the treatment. Recent L2
experimental guidance (Rogers & Révész, 2020), however, has conceptualized pretests as vital to
establishing pre-treatment equivalency among the experimental groups. This makes sense as L2 learners
will most likely come into classrooms with varying experiences learning the target language. L2
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 684
methodological syntheses have observed that experimental designs for the most part tend to use pre-tests.
Farsani and Babaii (2020) found that 90% of experimental designs in graduate theses included pre-tests
while Plonsky (2013) observed 67% pretest use among reports located in major L2 journals.
Power
Statistical power refers to having a large enough sample to detect a selected effect (Brysbaert, 2019). In
other words, powered samples are planned around the effect size, i.e., quantification of the effect in the
sample that the researcher intends to detect. It is related to type II error, the probability of rejecting an
observed effect (size) in the sample as existing in the population when in reality it does exist there (false
negative). The more power a sample has the less chance it has of rejecting such true effects, i.e. type II
error. In simpler terms, having more participants equates to higher power. While post-hoc power
procedures do exist, telling the reader the observed power of a finding assuming it actually exists in the
population, conducting power analyses before the research referencing relevant effects (a priori power
analysis) is considered a gold standard (Aberson, 2019). Power has thus traditionally been viewed as
within the frequentist domain, but power-like analysis exists in the Bayesian realm with Kruschke (2015)
describing it as “the probability of achieving the goal of a planned empirical study, if a suspected
underlying state of the world is true" (p. 359). Other sample planning processes such as precision
(Cumming, 2012), which determines the required sample size large enough to narrow the confidence
interval for a given effect, also require the a priori selection of a certain effect size to execute sample
planning. The overarching point is that researchers must plan their sample size referencing past effect
sizes.
When such sample planning is omitted, as highlighted by Brysbaert (2019), findings become of little
interpretive value vis-à-vis uncovering effects existing in the population. Underpowered studies,
subsuming pilot studies, that reject small effects could have had a sample too small to detect a small but
true effect existing in the population. In a similar vein, large effects in underpowered studies might be
‘flukes’ unrepresentative of the true parameter(s) within the population (see Brysbaert, 2019). Both past
(Plonsky, 2013) and recent (Farsani & Babaii, 2020) L2 research syntheses have unfortunately observed
that few studies (less than 15%) consider power in any fashion with a priori versus other power analyses
not always clearly catalogued.
Randomized Assignment
Randomized (or random) assignment is one side of the randomization process which one expects to
find in experimental designs, with the other being randomized sampling or randomly drawing participants
from the population (Kuehl, 2000). Random assignment refers to assigning experimental conditions
randomly to participants; the gold standard for this process is at the participant level (Rogers & Révész,
2020). When ‘writing up’ experiments in reports, there is a need to explicitly detail the random
assignment process given its importance to experimental designs. L2 researchers, however, are sometimes
precluded from implementing participant-level random assignment as they perform research on pre-
existing classes/groups (Farsani & Babaii, 2020). With classroom-orientated research areas such as
flipped learning, random assignment is often only possible at the class-level (e.g., Hung, 2015). Plonsky
(2013) and Farsani and Babaii (2020), when coding random assignment, therefore coded the random
assignment using the person- versus class/group-level distinction. Unlike power, the field has generally
adopted random assignment in its experimental design processes; Farsani and Babaii observed 58%
randomized assignment use, subsuming both levels, within its report pool which corresponds with the
47% observed earlier by Plonsky (2013).
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 685
Multi-site Use
Randomized sampling, subsuming other probability sampling approaches such as stratified (see Harter,
2008), are expensive and labor intensive and perhaps impractical for L2 researchers. Recent examples of
such sampling procedures (e.g., Hiver & Al-Hoorie, 2020), see sample sizes over 1000 drawn from
various regions within the country and/or intended population. It is therefore unsurprising that recent L2
methods guidance has either emphasized randomized assignment (e.g., Rogers & Révész, 2020) or
mentioned randomization in a general sense (e.g., Gass et al., 2020). Many L2 researchers most likely are
practically constrained from constructing randomized samples.
If randomized sampling is at one end of the continuum and perhaps impractical in relation to selecting
participants for a sample, then single-site convenience samples are at the other end. Imagine that one
randomly selected 200 Chinese High School EFL students from a national database. The probability that
all of them would be from the same region yet alone the same school is for all practical purposes zero. It
is therefore unsurprising that there has been a recent call to replicate past L2 studies using multi-site
samples (see Morgan-Short et al., 2018). Multi-site samples, while still probably being convenience
samples conceptually, create a “random(effect)-by site” (Morgan-Short et al., 2018, p. 408) which can
allow for a better generalization to meaningful populations than single-site samples assuming the sites’
effect is minimal vis-à-vis the fixed experimental effects. While recent L2 syntheses such as Farsani and
Babaii (2020) did investigate randomized (probability) versus convenience sampling, it appears the multi-
versus single-site sampling has yet to be overtly explored within L2 research syntheses.
Present Study
In the present study, we reviewed a report pool of L2 flipped experimental (full- and quasi-
experimental) designs in relation to the measurement and sampling issues that were selected as being both
consequential and addressable. The research question governing the present study is:
To what extent do L2 flipped experimental designs: a) report reliability, b) use pretests, c) consider
power, d) employ randomized assignment, and e) use multi-site samples?
As highlighted above, these issues have been the focus of past research syntheses (e.g., Plonsky, 2014)
or been recently brought to the attention of L2 researchers (e.g., multi-site use; Morgan-Short et al., 2018).
The rationale of our investigation was to highlight areas that could be improved upon by future
researchers while also providing empirical data on how satisfactory L2 flipped learning experimental
designs appear to be in the areas investigated.
Report Pool Creation
The report pool utilized in this study was first created for a meta-analysis of experimental designs
investigating L2 flipped learning interventions. Following the conventions for a robust literature search
process to mitigate selection bias, multiple databases, indexes, and search engines were employed to
review over 50,000 reports with over 8000 papers manually inspected as well against our search criteria
(see Vitta & Al-Hoorie, 2020). The researchers also published a call for papers to capture unpublished
reports.
The process terminated in August 2019 (with no beginning time constraint) and 56 reports were
selected. Of these, 45 of these were journal articles and 11 were either conference proceedings (k = 4
reports) or theses and dissertations (k = 7). Our report pool total was larger than recent research syntheses
such as Låg and Sæle (2019) who located 23 L2 flipped experimental designs.
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 686
Operationalization, Coding and Results
The five areas of inquiry driving this investigation and reflected in the governing research question
were operationalized into dichotomous judgments (items) located in Table 1. A ‘yes’ response meant that
the report was satisfactory; a ‘no’ meant that it was not. For some judgments, further explanation on the
yes/no delineation has been present as ‘additional notes.’ With the exception of power, one item or
judgment operationalized each issue investigated. Since other forms of power analysis besides a priori,
which is the gold standard (Brysbaert, 2019), have been considered useful by some (Aberson, 2019), Item
3a was included to capture such iterations of power analysis.
For each report, judgments were made for the items found in Table 1. After this initial coding, a second
researcher coded 25% of the report pool (k = 14). Because of the overwhelming instances of ‘no’
judgements for some items, inferential testing of judgements, e.g., Cohen’s kappa, was not possible
(McHugh, 2012). Instead, raw percentage of initial agreement was employed and ranged from 71.42% to
100% across the six items as highlighted in Table 1 with an average agreement of 88.09% (SD = 11.67%)
observed. Differences were resolved via discussion.
TABLE 1
Judgments of Selected Measurement and Sampling Design Issues
Item
Additional Notes
Inter-rater
Reliability
Item 1. Did the study report the
reliability of the L2 outcome
variable(s)’ measurements?
Coded ‘yes’ if a standardized test was employed as
outcome variable or a published paper of said exam
was used.
Coded ‘no’ if qualitative validation process, e.g.,
peer/colleague review, of test measuring classroom
outcomes such as targeted vocabulary items.
Marked ‘no’ if reliability was reported for some but
not all L2 outcomes.
78.57%
Item 2. Was a pretest employed to
establish pre-treatment equivalency
among the groups/experimental
conditions in relation to the outcome
variable?
Coded ‘no’ if pre-treatment measurement was not
measuring the same L2 outcome as the posttest.
Coded ‘no’ if pretest was not fully reported.
Coded ‘yes’ if measurement was near start of treatment
and demonstrated equivalency.
85.71%
Item 3a. Was power considered in the
study in any fashion?§
Included post-hoc, compromise, etc.
100%
Item 3b. Was a priori power analysis
conducted referencing relevant effect
sizes?
Had to be reported as an a priori process referencing
relevant L2-centric effect sizes.
100%
Item 4. Did the study engage in
random assignment at either the
participant or class/group level?
Coded ‘yes’ only if there was explicit
description/labeling of the random assignment process.
Text had to explicitly state chance was involved in
assigning conditions.
Some reports conflated group- and participant-level
assignment by mislabeling what was clearly group-
level assignment as participant-level.
71.42%
(before explicit
clarification)
Item 5. Was the sample multi-site?
Coded ‘no’ when report was ambiguous.
92.85%
Note 1. It may be unreasonable to expect all classroom researchers to obtain reliability data for such test data. Past
papers, likewise, have been externally validated but in the strictest sense, future research should check instrument
reliability.
Note 2. No study employed Bayesian analyses or mentioned sampling planning procedures such as precision
(Cumming, 2012). Accordingly, power relates to NHST-centric analysis here.
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 687
As highlighted in Figure 1, the investigated methodological issues had a varying range of frequency of
occurrence. In sum, L2 flipped experimental reports appeared to be more satisfactory in the measurement
issues than in sampling. Considering the former, about 70% of the reports (Item 2: k = 39) employed a
pretest and half of the reports (Item 1: k = 28) reported reliability according to the standards presented in
Table 1. Regarding sampling, only about 36% (Item 4: k = 20) and about 18% (Item 5: k = 10) of the
reports featured randomized assignment and multi-site use, respectively. No report (Item 3b) employed
the gold standard of a priori power analysis.
Figure 1. Frequency of Measurement and Sampling Issues in L2 Flipped Experimental Reports.
Discussion
The implications of the current study’s findings are two-fold. First, L2 flipped experimental designs
report reliability and use pretests with a similar frequency observed in other L2 research syntheses.
Second, there is room to improve sampling designs in future L2 flipped experiments. We discuss each in
turn.
Reliability and Pretest Use
The frequency of reliability checking (50%, k = 28) was within the range of observations observed in
past methodological syntheses. Consider, for instance, that Plonsky and Gass (2011) observed 64% of
reports reported reliability while Plonsky (2014) observed that 50% of reports published in the 2000s had
also done so. Reports in the pool can therefore be referenced for appropriate practice. When multiple-
choice tests or other close-ended instruments are employed to facilitate the outcome variable, a reliability
metric such as Cronbach’s alpha or KR-21 is needed to demonstrate the observed internal consistency of
the measurement. Al-Harbi and Alshumaimeri (2016) provided such information when reporting the
reliability of their grammar measurement. When the measurement is derived from rater (human)
judgements, inter-rater reliability is the standard to establish the reliability of the judgments. Bonyadi
(2018) provides a positive example of this process in validating judgments of oral performance. When
there is approximate balance among the possible judgements, inferential testing and corresponding
metrics such as Cohen’s kappa (κ) can be used to correct for the contribution of chance to the observed
percentage of agreement.
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 688
As with reliability, 39 reports (around 70%) included a pretest which corresponded to observations
such as Plonsky’s (2013) observed 67%. Pretest use is somewhat straightforward compared to reliability,
but future research should be mindful of how pretests can affect inferential testing. A handful of reports
employed ANCOVA (see Field, 2018) where the outcome (dependent) variable was either a gain score
(posttestpretest) or the posttest and the covariate was the pretest. Future research should be mindful
however that such designs (pretest as predictor/covariate) have been observed to create issues in relation
to model estimates (Lord’s paradox; see Allison, 1990, p. 96) with gain scores alone as dependent
variable perhaps yielding more accurate results in fixed-effect models.
Reforming and Improving Sampling Issues
Power considerations were almost completely ignored within our report pool and this corresponded to
past L2 methodological syntheses (e.g., Plonsky, 2013). As highlighted by Brysbaert (2019), a priori
power is the gold standard for effect size planning and requires a pre-determined effect to begin the
process. In the larger study preceding this report, the researchers (Vitta & Al-Hoorie, 2020) observed a
(corrected for publication bias) aggregated effect of g = 0.58 where flipped interventions were more
effective than non-flipped comparisons. This value corresponded to Plonsky and Oswald’s (2014) median
between-subject effect of d = 0.70 in group comparisons in L2 research. Calculating power from g = 0.58
for a 3-group design (e.g., treatment, comparison, control) shows that researchers need 120 participants
with 40 per group (calculated with G*Power; Faul, Erdfelder, Lang, & Buchner, 2007; see Appendix A).
Post-hoc comparisons would have to be executed using tests such as Tukey or Games-Howell to maintain
significance with this sample size. A two-group design requires two groups of 48 participants (N = 96).
Should researchers wish to be conservative in their power calculations, the aggregated effect from an
overarching review of education research (d = 0.40; see Hattie, 2009) could be employed as opposed to g
= 0.58.
Random assignment and multi-site use can work in tandem to enhance the generalizability facilitated
by samples found in future L2 flipped experimental designs. As highlighted above, random assignment is
observed in about a third of the report pool which was somewhat lower than observations in past research
syntheses, e.g., 58% observed by Farsani and Babaii (2020). Nevertheless, 20 reports did implement this
feature and these can act as positive examples. Hung (2015) for instance provides a clear rationale for and
description of random assignment at the class level. To be able to better utilize multi-site samples,
researchers could collaborate with each other to construct samples with multiple locations reflective of
the intended population. Overt descriptions of such collaboration was missing from the report pool, but
consider as an example vocabulary within an Asian TEFL context, Japan, which like flipped interventions
are often investigated at the classroom level. McLean, Kramer, and Beglar (2015) presented a multi-
authored report where the co-authors facilitated different sites from which to construct their sample.
Where one can improve on this approach would be the addition of a degree of randomness. Assume that
six researchers provide access to six Vietnamese universities. After a priori power calculations, the
researchers could select the required sites (from the six available universities) to meet the power-
determined sample size threshold. Alternatively, participants could be drawn randomly from the six sites
to attain an appropriately powered sample.
Conclusion
In this focused methodological synthesis, 56 L2 flipped experimental designs were analyzed in relation
to their reporting of reliability, pretest use, power considerations, use of random assignment and multi-
site samples. Taken together, the designs seemed adequate in handling reliability and including pretests,
but there is room for improvement in future L2 flipped research. Power considerations and multi-site
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 689
samples were largely missing, and so future flipped research could consider these issues in order to
enhance the generalizability of findings.
Data Availability: Data sheet and report pool bibliography, available upon request to corresponding
author.
Acknowledgements
The authors acknowledge Dr. Jeffery Mehring and Christopher Nicklin for their assistance with this
project.
The Authors
Joseph P. Vitta (corresponding author) is an Associate Professor at Kyushu University, Fukuoka, Japan.
Kyushu University Faculty of Languages and Cultures
744 Motooka, Nishi-ku, Fukuoka-shi, Fukuoka-ken, 819-0395
Tel: +81 092-802-2125
Email: vittajp@flc.kyushu-u.ac.jp
ORCID: 0000-0002-5711-969X
Ali H. Al-Hoorie is an assistant professor at the Jubail English Language and Preparatory Year Institute,
Royal Commission for Jubail and Yanbu, Saudi Arabia.
Jubail English Language and Preparatory Year Institute
Royal Commission for Jubail and Yanbu
Jubail Industrial City 31961
Saudi Arabia
Email: hoorie_a@jic.edu.sa
ORCID: 0000-0003-3810-5978
References
Aberson, C. L. (2019). Applied power analysis for the behavioral sciences. Routledge.
Al-Harbi, S. S., & Alshumaimeri, Y. A. (2016). The flipped classroom impact in grammar class on EFL
Saudi secondary school students’ performances and attitudes. English Language Teaching, 9(10),
6080.
Al-Hoorie, A. H., & Vitta, J. P. (2019). The seven sins of L2 research: A review of 30 journals’ statistical
quality and their CiteScore, SJR, SNIP, JCR Impact Factors. Language Teaching Research, 23(6),
727744. https://doi.org/10.1177/136216881876719
Allison, P. D. (1990). Change scores as dependent variables in regression analysis. Sociological
Methodology, 20, 93114.
Bonyadi, A. (2018). The effects of flipped instruction on Iranian EFL students’ oral interpretation
performance. The Journal of Asia TEFL, 15(4), 11461155.
Brysbaert, M. (2019). How many participants do we have to include in properly powered experiments? A
tutorial of power analysis with reference tables. Journal of Cognition, 2(1), 138. https://doi.org/
10.5334/joc.72
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 690
Cumming, G. (2012). Understanding the new statistics: Effect sizes, confidence intervals, and meta-
analysis. Routledge.
Farsani, M. A., & Babaii, E. (2020). Applied linguistics research in three decades: A methodological
synthesis of graduate theses in an EFL context. Quality & Quantity, 54(4), 12571283.
https://doi.org/10.1007/s11135-020-00984-w
Faul, F., Erdfelder, E., Lang, A. G., & Buchner, A. (2007). G*Power 3: A flexible statistical power
analysis program for the social, behavioral, and biomedical sciences. Behavior Research Methods,
39(2), 175191. https://doi.org/10.3758/BF03193146
Field, A. (2018). Discovering statistics using IBM SPSS statistics. Sage.
Gass, S., Loewen, S., & Plonsky, L. (2020). Coming of age: The past, present, and future of quantitative
SLA research. Language Teaching, 1-14. https://doi.org/10.1017/s0261444819000430.
Harter, R. (2008). Random sampling. In P. Lavrakas (Ed.), Encyclopedia of survey research methods (pp.
683684). Sage.
Hattie, J. A. C. (2009). Visible learning: A synthesis of over 800 meta-analyses relating to achievement.:
Routledge.
Hiver, P., & Al ‐Hoorie, A. H. (2020). Reexamining the role of vision in second language motivation: A
preregistered conceptual replication of You, Dörnyei, and Csizér (2016). Language Learning,
70(1), 48102 https://doi.org/10.1111/lang.12371
Hung, H.-T. (2015). Flipping the classroom for English language learners to foster active learning.
Computer Assisted Language Learning, 28(1), 8196.
Kruschke, J. K. (2015). Doing Bayesian data analysis: A tutorial with R and BUGS (2nd ed.). Academic
Press.
Kuehl, R. O. (2000). Design of experiments: Statistical principles in research design and analysis.
Duxbury.
Låg, T., & Sæ le, R. G. (2019). Does the flipped classroom improve student learning and satisfaction? A
systematic review and meta-analysis. AERA Open, 5(3), 1-17. https://doi.org/10.1177/23328584
19870489
Linck, J. A., & Cunnings, I. (2015). The utility and application of mixed-effects models in second
language research. Language Learning, 65(S1), 185207. https://doi.org/10.1111/lang.12117
Mahmud, M. M. (2018). Technology and languageWhat works and what does not: A meta-analysis of
blended learning research. The Journal of Asia TEFL, 15(2), 365382. https://doi.org/10.18823/
asiatefl.2018.15.2.7.365
McHugh, M. L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica, 22(3), 276282.
https://doi.org/10.11613/bm.2012.031
McLean, S., Kramer, B., & Beglar, D. (2015). The creation and validation of a listening vocabulary levels
test. Language Teaching Research, 19(6), 741760. https://doi.org/10.1177/1362168814567889
Mehring, J. (2018). The flipped classroom. In J. Mehring & A. Leis (Eds.), Innovations in flipping the
language classroom: Theories and practices (pp. 110). Springer.
Morgan‐Short, K., Marsden, E., Heil, J., Issa II, B. I., Leow, R. P., Mikhaylova, A., Mikołajczak, S.,
Moreno, N., Slabakova, R., & Szudarski, P. (2018), Multisite replication in second language
acquisition research: Attention to form during listening and reading comprehension. Language
Learning, 68(2), 392437. https://doi.org/10.1111/lang.12292
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting practices in
quantitative L2 research. Studies in Second Language Acquisition, 35(4), 655687. https://doi.org/
10.1017/S0272263113000399
Plonsky, L. (2014). Study quality in quantitative L2 research (19902010): A methodological synthesis
and call for reform. The Modern Language Journal, 98, 450470. https://doi.org/10.1111/j.1540-
4781.2014.12058.x
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 691
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes: The case of
interaction research. Language Learning, 61(2), 325366. https://doi.org/10.1111/j.1467-9922.
2011.00640.x
Plonsky, L., & Oswald, F. L. (2014). How big is “big”? Interpreting effect sizes in L2 research. Language
Learning, 64(4), 878912. https://doi.org/10.1111/lang.12079
Rogers, J., & Révész, A. (2020). Experimental and quasi-experimental designs. In J. McKinley & H.
Rose (Eds.), The Routledge handbook of research methods in applied linguistics (pp. 133143).
Routledge.
Tang, T., Abuhmaid, A. M., Olaimat, M., Oudat, D. M., Aldhaeebi, M., & Bamanger, E. (2020).
Efficiency of flipped classroom with online-based teaching under COVID-19. Interactive
Learning Environments, 112. https://doi.org/10.1080/10494820.2020.1817761
Vitta, J. P., & Al-Hoorie, A. H. (2020). The flipped classroom in second language learning: A meta-
analysis. Language Teaching Research. Advance online publication. https://doi:10.1177/136216
8820981403
(Received March 10, 2021; Revised May 20, 2021; Accepted June 18, 2021)
Joseph P. Vitta & Ali H. Al-Hoorie The Journal of Asia TEFL
Vol. 18, No. 2, Summer 2021, 682-692
2021 AsiaTEFL All rights reserved 692
Appendix A
Power Calculations
g = .58 equates to eta-squared = .077567 where eta-squared = d2 / (d2 + 4) extrapolated from formula in
Brysbaert (2019)
Three group one-way ANOVA
Two-group t-test
... These two strands of randomization have historically been argued as essential features of experimental samples (Fisher, 1935). Because true random selection is impractical for most L2 researchers, particularly those investigating less common languages (Nicklin & Vitta, 2021;Vitta & Al-Hoorie, 2021), multisite samples have been presented as a compromise (Moranski & Ziegler, 2021), and also comprise the third aspect considered in the present study. Samples recruited from more than one location are an achievable means of attaining a sufficient number of participants to meet the requirements for L2 researchers' analyses of choice (Nicklin & Vitta, 2021), and also temper the bias that single-site samples introduce to quantitative inquiries (Morgan-Short et al., 2018). ...
... Classical a priori NHST power analysis is perhaps the best-known method to plan sample sizes using effect sizes. Recent L2 methodological syntheses (e.g., Farsani & Babaii, 2020;Lindstromberg, 2016;Vitta & Al-Hoorie, 2021) have exclusively coded for classical power given NHST's prevalence among SLA quantitative research, despite arguments to move away from it (e.g., Cumming, 2012;Plonsky, 2015). As highlighted by Nicklin and Vitta (2021) and Brysbaert (2019), such power analyses should reference relevant standardized effect sizes. 2 The process entails the calculation of the sample size required to detect a certain effect size assuming predetermined alpha (α) and beta (β) thresholds. ...
... Recruiting participants from the intended population across two or more institutions or locations constitutes a multisite sample. The process of recruiting from multiple locations relates to the need for randomized sampling in quantitative research (Vitta & Al-Hoorie, 2021). Moranski and Ziegler (2021) argued that single-site samples present an external validity limitation, which reduces the ability to generalize findings beyond the context in which the data was collected. ...
Article
Full-text available
In this focused methodological synthesis, the sample construction procedures of 110 second language (L2) instructed vocabulary interventions were assessed in relation to effect size–driven sample-size planning, randomization, and multisite usage. These three areas were investigated because inferential testing makes better generalizations when researchers consider them during the sample construction process. Only nine reports used effect sizes to plan or justify sample sizes in any fashion, with only one engaging in an a priori power procedure referencing vocabulary-centric effect sizes from previous research. Randomized assignment was observed in 56% of the reports while no report involved randomized sampling. Approximately 15% of the samples observed were constructed from multiple sites and none of these empirically investigated the effect of site clustering. Leveraging the synthesized findings, we conclude by offering suggestions for future L2 instructed vocabulary researchers to consider a priori effect size–driven sample planning processes, randomization, and multisite usage when constructing samples.
... Study quality has become central to many subdomains of SLD research (Gass et al., 2021). For instance, many syntheses have demonstrated that design tendencies related to measurement and sampling in the field leave much to be desired (Brown et al., 2018;Nicklin & Plonsky, 2020;Vitta & Al-Hoorie, 2021). Others have highlighted the need for greater transparency in checking and reporting assumptions (Hu & Plonsky, 2021), and increased rigor in data analytical strategies and reporting results (Al-Hoorie & Vitta, 2019;Larson-Hall & Plonsky, 2015;Marsden et al., 2018;Paquot & Plonsky, 2017;Plonsky, 2013Plonsky, , 2014. ...
... As such, CDST research must move beyond the exclusively metaphorical application that describes findings with a language borrowed from CDST (Hiver & Al-Hoorie, 2020b). Metaphors may be adequate if we wish to conceptualize phenomena (Larsen-Freeman & Cameron, 2008); however, the field must move forward to operationalize and validate these phenomena and investigate them empirically (see also Brown et al., 2018;Nicklin & Plonsky, 2020;Vitta & Al-Hoorie, 2021). These findings suggest the importance of greater transparency and rigor in the design choices of future CDST research, and also underscore the need for study designs to clarify the ways in which they are informed by CDST (see Gass et al., 2021, for a detailed discussion of study quality). ...
Article
Full-text available
A quarter of a century has passed since complex dynamic systems theory was proposed as an alternative paradigm to rethink and reexamine some of the main questions and phenomena in applied linguistics and language learning. In this article, we report a scoping review of the heterogenous body of research adopting this framework. We analyzed 158 reports satisfying our inclusion criteria (89 journal articles and 69 dissertations) for methodological characteristics and substantive contributions. We first highlight methodological trends in the report pool using a framework for dynamic method integration at the levels of study aim, unit of analysis, and choice of method. We then survey the main substantive contribution this body of research has made to the field. Finally, examination of study quality in these reports revealed a number of potential areas of improvement. We synthesize these insights in what we call the "nine tenets" of complex dynamic systems theory research, which we hope will help enhance the methodological rigor and the substantive contribution of future research.
... This not only parallels the widespread use of such measures in general, non-language learning SDT research, it dovetails with the data collection procedures reviewed earlier indicating that 95.5% of studies in the pool relied on survey or questionnaire data elicitation methods (see also Al-Hoorie, 2018). This frequency corresponds to other applied linguistics syntheses such as Vitta and Al-Hoorie (2021), who similarly found about 50% frequency of reliability reporting in experimental L2 flipped classroom research. This indicates that L2 SDT quantitative research appears to examine the psychometric properties of their instruments at the same low rate as some other L2 subcamps. ...
Preprint
Full-text available
Self-determination theory is one of the most established motivational theories both within second language learning and beyond. This theory has generated several mini-theories, namely: organismic integration theory, cognitive evaluation theory, basic psychological needs theory, goal contents theory, causality orientations theory, and relationships motivation theory. After providing an up-to-date account of these mini-theories, we present the results of a systematic review of empirical second language research into self-determination theory over a 30-year period (k = 111). Our analysis of studies in this report pool showed that some mini-theories were well-represented while others were underrepresented or absent from the literature. We also examined this report pool to note trends in research design, operationalization, measurement, and application of self-determination theory constructs. Based on our results, we highlight directions for future research in relation to theory and practice.
Article
Full-text available
Self-determination theory is one of the most established motivational theories both within second language learning and beyond. This theory has generated several mini-theories, namely: organismic integration theory, cognitive evaluation theory, basic psychological needs theory, goal contents theory, causality orientations theory, and relationships motivation theory. After providing an up-to-date account of these mini-theories, we present the results of a systematic review of empirical second language research into self-determination theory over a 30-year period (k = 111). Our analysis of studies in this report pool showed that some mini-theories were well-represented while others were underrepresented or absent from the literature. We also examined this report pool to note trends in research design, operationalization, measurement, and application of self-determination theory constructs. Based on our results, we highlight directions for future research in relation to theory and practice.
Article
Full-text available
WATCH THE VIDEO: https://www.bilibili.com/video/BV1FS4y1j7ed - In contemporary methodological thinking, replication holds a central place. However, relatively little attention has been paid to replication in the context of complex dynamic systems theory (CDST), perhaps due to uncertainty regarding the epistemology-methodology match between these domains. In this paper, we explore the place of replication in relation to open systems and argue that three conditions must be in place for replication research to be effective: results interpretability, theoretical maturity, and termino-logical precision. We consider whether these conditions are part of the applied linguistics body of work, and then propose a more comprehensive framework centering on what we call SUBSTANTIATION RESEARCH, only one aspect of which is replication. Using this framework, we discuss three approaches to dealing with replication from a CDST perspective theory. These approaches are moving from a representing to an intervening mindset, from a comprehensive theory to a mini-theory mindset, and from individual findings to a cumulative mindset.
Article
Full-text available
Flipped learning has become a popular approach in various educational fields, including second language teaching. In this approach, the conventional educational process is reversed so that learners do their homework and prepare the material before going to class. Class time is then devoted to practice, discussion, and higher-order thinking tasks in order to consolidate learning. In this article, we meta-analysed 56 language learning reports involving 61 unique samples and 4,220 participants. Our results showed that flipped classrooms outperformed traditional classrooms, g = 0.99, 95% CI (0.81, 1.17), z = 10.90, p < .001. However, this effect had high heterogeneity (about 86%), while applying the Trim and Fill method for publication bias made it shrink to g = 0.58, 95% CI (0.37, 0.78). Moderator analysis also showed that reports published in non-SSCI-indexed journals tended to find larger effects compared to indexed ones, conference proceedings, and university theses. The effect of flipped learning did not seem to vary by age, but it did vary by proficiency level in that the higher proficiency the higher the effects. Flipped learning also had a clear and substantial effect on most language outcomes. In contrast, whether the intervention used videos and whether the platform was interactive did not turn out to be significant moderators. Meta-regression showed that longer interventions resulted in only a slight reduction in the effectiveness of this approach. We discuss the implications of these findings and recommend that future research moves beyond asking whether flipped learning is effective to when and how its effectiveness is maximized.
Article
Full-text available
The recent years have witnessed an increasing awareness in methodological issues in the field of applied linguistics, which brought about what Byrnes (Mod Lang J 97:825–827, 2013) and Plonsky (The Routledge handbook of instructed second language research, Routledge, New York, pp 505–521, 2017) referred to as “methodological turn” and “methodological awareness”, respectively. To contribute to this fresh line of research, this review, drawing on the methodological synthetic techniques, sought to describe and evaluate the aggregative and developmental status of experimental research in an EFL context. Having developed an experimental coding sheet and a manual book, we selected the eligible studies based on informed included/excluded criteria. As a result, we analyzed 285 unpublished applied linguistics MA theses which were distributed over a 30-year period. The cumulative results revealed a set of strengths and deficiencies across the data set. Of notable findings were inconsistent reporting practices (e.g., low statistical power, the minimum use of confidence intervals and effect sizes, and inconsistent reporting of p values). However, signs of improvement over three decades were conspicuous (e.g., reporting effect sizes, checking statistical assumptions, etc.). Implications and recommendations for the research methodologists, postgraduate students, and policy makers are discussed.
Article
Full-text available
We searched and meta-analyzed studies comparing flipped classroom teaching with traditional, lecture-based teaching to evaluate the evidence for the flipped classroom’s influence on continuous-learning measures, pass/fail rates, and student evaluations of teaching. Eight electronic reference databases were searched to retrieve relevant studies. Our results indicate a small effect in favor of the flipped classroom on learning (Hedges’ g = 0.35, 95% confidence interval [CI] [0.31, 0.40], k = 272). However, analyses restricted to studies with sufficient power resulted in an estimate of 0.24 (95% CI [0.18, 0.31], k = 90). Effects on pass rates (odds ratio = 1.55, 95% CI [1.34, 1.78], k = 45) and student satisfaction (Hedges’ g = 0.16, 95% CI [0.06, 0.26], k = 69) were small and also likely influenced by publication bias. There is some support for the notion that the positive impact on learning may increase slightly if testing student preparation is part of the implementation.
Article
Full-text available
Given that an effect size of d = .4 is a good first estimate of the smallest effect size of interest in psychological research, we already need over 50 participants for a simple comparison of two within-participants conditions if we want to run a study with 80% power. This is more than current practice. In addition, as soon as a between-groups variable or an interaction is involved, numbers of 100, 200, and even more participants are needed. As long as we do not accept these facts, we will keep on running underpowered studies with unclear results. Addressing the issue requires a change in the way research is evaluated by supervisors, examiners, reviewers, and editors. The present paper describes reference numbers needed for the designs most often used by psychologists, including single-variable between-groups and repeated-measures designs with two and three levels, two-factor designs involving two repeated-measures variables or one between-groups variable and one repeated-measures variable (split-plot design). The numbers are given for the traditional, frequentist analysis with p < .05 and Bayesian analysis with BF > 10. These numbers provide researchers with a standard to determine (and justify) the sample size of an upcoming study. The article also describes how researchers can improve the power of their study by including multiple observations per condition per participant.
Chapter
Full-text available
Researchers within the field of applied linguistics have long used experiments to investigate cause-effect relationships regarding the use and learning of second languages (L2s). In experimental research, one or more variables are altered and the effects of this change on another variable are examined. This change or experimental manipulation is usually referred to as the treatment. Researchers typically draw upon either experimental or quasi-experimental research designs to determine whether there is a causal relationship between the treatment and the outcome. This chapter outlines key features and provides examples of common experimental and quasi-experimental research designs. We also make recommendations for how experimental designs might best be applied and utilized within applied linguistics research.
Article
Full-text available
Vivid mental imagery, particularly of the self in future states, has been linked to a range of desirable motivational outcomes for language learning. In this study, we report a pre-registered conceptual replication and extension of You, Dörnyei, and Csizér (2016), who found a central motivational role for vision. We review essential considerations in structural equation modeling and discuss how these were addressed in the initial study. Applying these considerations, we then describe a conceptual replication with a South Korean sample (N = 1,297) of secondary school language learners of English. Our analysis of the scales used in the initial study, plus second language achievement, found support for an alternative model to that found by the initial study in that intended effort showed a better fit as a predictor of motivation rather than an outcome variable. Our findings suggest the need for greater precision and rigor in structural equation modeling research on second language learning motivation, and for greater numbers of language researchers to take up replication and other open science initiatives.
Article
Full-text available
This meta-analysis examines the effectiveness of technology employed in language-related blended-learning research by summarizing the outcomes of the measured dependent variables of 59 samples. The effect sizes yielded from the samples were acquired by applying Cohen’s (1988; 1992) d formula. The estimation was done using the standardized mean difference score, divided by the standard deviation pooled across the treatment and control groups. The findings denote that there is an overall effectiveness to blended-learning; however, the disparity of the effect sizes found implies that the effectiveness is contingent and reliant to the context and how technology is applied. There were also instances of negative effect sizes, suggesting hidden factors that adversely altered the outcomes of the technological intervention. The review also discovered that there is a pattern for performance to be used predominantly as the dependent variable in assessing the effectiveness of the technology. Nevertheless, this should not limit the use of performance as the only measure. Other dependent variables, such as motivation and attitudes, warrant consideration as indicators for measuring the efficacy of a blended-learning intervention.
Article
Due to the spread of COVID-19 worldwide, a large number of universities had to close their campuses. To maintain teaching and learning during this disruption to the traditional teaching, most universities have adopted online teaching model. The current study aimed at investigating the efficacy of various online teaching modes as well as comparing a proposed combined model of online and flipped learning to other online and traditional models. The Learning under COVID19questionnaire was designed and administered to undergraduate engineering students at Chengdu University of Information Technology (CUIT). The questionnaire included five parts: demographic questions, frequencies of online courses, types of online courses, the communication and Q&A in online classes and the effect of online classes, as well as the effect of combined model learning. The results of the study showed that, students were dissatisfied with online learning in general, and they were especially dissatisfied with the communication and Q&A modes. In addition, the combined model of online teaching with the flipped learning improved students’ learning, attention, and evaluation of courses.
Article
First, we trace the history of second language acquisition (SLA) from early stages in the mid-twentieth century to today. We next consider the status of the field in today's research world with a particular focus on all aspects of methodology and, finally, we take a look at the future and discuss issues related to scientific rigor in light of Open Science.