Conference PaperPDF Available

Methodological and technological considerations in flipped language learning interventions: A systematic review



Flipped learning has become an important area of investigation in the second language field. We located 56 flipped learning interventions through a systematic search on databases including Scopus and Proquest. Analysis of methodological and design features of these reports showed that almost half of them did not check the reliability of their outcome variables, that about 30% failed to include an empirical pre-test, and that no report conducted an a priori power analysis. We offer guidance on how to address these methodological and design issues after identifying them. Our results also showed that all reports relied on technology to flip their classrooms. Most of these interventions (75%) used videos and under half (41%) employed an interactive platform where students interacted with and/or through the technology. Using examples from the report pool, we then highlight how interactive and multimodal flipped applications might be most effective in light of recent theory, especially the drive to develop 21st century skills (Laar et al., 2020). Finally, we make suggestions for future research based on gaps in our report pool, such as more research on certain language outcomes, on languages other than English, and on younger learners.
Methodological and technological
considerations in flipped language
learning interventions:
A systematic review
Joseph P. Vitta, Rikkyo University, Japan (
Ali H. Al-Hoorie (
June 6, 2020
Defining flipped learning
Flipped learning effectiveness
Systematic review
Methodological features
Power, reliability & use of pre-test
Review of technology use
Videos & interactive technology use
Gaps for feature Research
21st century skills
Flipped Learning as a
Debated Construct
Minimum definition: new content
precedes class time in the form of
homework/outside of class study
Higher order thinking and agency
as the defining features of flipped
Technology as the defining feature
of flipped
Flipped as communicative language teaching?
Webb and Doman (2016): Flipped Learning ≠ CLT
Hung (2015) and Ishikawa et al. (2015) both compared flipped
treatment groups with CLT comparison groups.
Chen Hsieh et al. (2017) had students draft “the final dialog
collaboratively” (p. 4) under the conventional learning condition.
Ideal Conditions for Flipped (Mehring, 2018)
Procedural and conceptual outcomes
Motivated and trustworthy students
Technologically literacy on both the student and teacher ends
Flipped Learnings
Usually tested via experimental designs
where flipped is a treatment compared
with non-flipped learning conditions
Past meta-analyses have found flipped
groups perform better by a magnitude of
0.3 to 0.5 standard deviations (Cohen’s
d) across different educational domains
Humanities, subsuming second/foreign
language learning, have slightly higher
effects observed (e.g., Cheng et al., 2019,
gor corrected d= 0.63)
From our meta-analysis of L2 flipped interventions
- Magnitude of L2 d (Plonsky & Oswald, 2014): 0.4 (small), 0.7 (medium), 1.0 (large)
- Overall effect g (corrected d) = 0.99, but it corrects to 0.58 when considering
possible publication bias
L2 outcome
95% CI Upper
95% CI
13 1.50 1.00 1.99
4 1.42 0.62 2.21
8 1.14 0.81 1.48
-skill 14 1.03 0.65 1.41
5 1.01 0.38 1.63
9 0.25 0.03 0.47
Standardized test performance
4 0.33 0.07 0.72
3 1.25 0.09 2.59
From our meta-analysis,
concerning observations:
Lack of details on the ‘flip’
Much higher effects in non-
SSCI journals (points to
publication bias and
methods issues) Scopus g
= 1.39
Executed a systematic
review in response
Overview of Study
Systematic review is a research synthesis that identifies and quantifies
trends in a report pool:
RQ1: What are the observed methodological issues in L2 flipped experimental
Power, reliability, use of pre-test
RQ2: How do L2 flipped interventions employ technology in flipping the content?
Videos, web 1.0 vs. web 2.0
RQ3: What additional ‘gaps’ in relation to age, L2, and learning outcomes emerge
from a review of L2 flipped experimental reports?
Judgements were validated via inter-rater reliability checking reported in the larger
meta-analysis report – 85% to 88% agreement (.7 ≤ κ ≤ .86)
The report pool
Many L2 meta-analyses and
systematic report pools limit
their report search to a few
journals but this goes against
the spirit of research
synthesis endeavors where
comprehensiveness is king.
Robust search ending in 56
experimental reports more
than double the L2 reports in
either flipped review studies
RQ1. Power, i.e. minimum sample size
Finding: No report engaged in an a priori power analysis
Why is this an issue? Power calculations tie research to the existing body
of literature. Underpowered and no-power samples may not be
untrustworthy in both directions: 1) they can miss true but small effects
in the population (type 2 error) and 2) detected large effects could be
flukes (Brysbaert, 2019)
How to fix it? Lets assume d= 0.58, calculate power G*Power software:
2 groups of 48 basic design that tests flipped vs. non-flipped RQ
3 groups of 40 treatment, comparison, control design + ANOVA testing
Values from other meta-analyses can be substituted
RQ1. Reliability
53% of reports (30 of them) reported the reliability of the measurement
of the outcome/dependent variable
Why is it important? Without reliability, validity is impossible to satisfy,
and so the study could be flawed. Validity of instruments, especially in L2,
can be assumed across studies, BUT reliability must be checked in every
instance (Al-Hoorie & Vitta, 2019).
How to fix it? Inter-rater agreement when outcome is a proficiency
judgment. Cronbach’s alpha (KR-21) on test items. When using
standardized tests that have been externally marked, clearly identify this in
Chen Hsieh et al. (2017): positive example
RQ1. Pretest Usage
Finding = 39 reports (~70%) employed pre-tests to empirically
demonstrate pre-treatment equivalence before the treatment.
In all, this was a positive observation and speaks well of our field.
While experimental designs in the strictest sense only require post-
test comparisons, pre-treatment equivalence is important in L2
research as we’re pushing the boundaries of experimental research as
it is, i.e., we’re not in a lab.
Curious example: 2017 study (reading outcome) with g= 2.89
RQ2 100% of reports used technology
Video Usage
75% (42 reports) used videos
Video as preferred medium of
flipped applications
Easy for teachers to made/control
Accessibility of YouTube etc.
Interactive Technology
41% (23 reported) used
interactive technologies
WhatsApp was a popular choice
Learning management systems
such as Edmodo
Web 2.0 as the natural
complement to flipped learning?
RQ3. Gaps Emerging within Report Pool
91% of reports (51 studies) saw English as the L2
75% reports (42 studies) had samples of university language learners
L2 outcomes involving competencies underpinning proficiency skills
appears to be under-researched:
Vocabulary ~11% (6 studies)
Grammar ~9% (5 studies)
Pronunciation 0 studies
Implications for Researchers Going Forward
There is room for improvement in flipped experimental designs,
especially in relation to power analysis and psychometric checking.
There is a need for future research to investigate the effects of flipped
learning on non-university students and learners of other languages
besides English.
The positive and strong effects of flipped learning on procedural
proficiency skills appears to be established, but flipped learning
effects on outcomes such as grammar and vocabulary are still unclear.
Implications for Teachers Going Forward
Overall, the research supports the view
that flipped learning is effective in our
Try to flip using interactive technology
Process/skill outcomes seem suitable for
flipped learning
You can join the flipped learning
academic discussion via frontline
qualitative/action research into how
flipped is working
21st century skills provides a justification
for using flipped learning in your
Al-Hoorie, A. H., & Vitta, J. P. (2019). The seven sins of L2 research: A review of 30 journals’ statistical quality
and their CiteScore, SJR, SNIP, JCR Impact Factors. Language Teaching Research, 23(6), 727-744.
Brysbaert, M. (2019). How many participants do we have to include in properly powered experiments? A
tutorial of power analysis with reference tables. Journal of Cognition, 2(1), 1-38.
Chen Hsieh, J. S., Wu, W.-C. V., & Marek, M. W. (2017). Using the flipped classroom to enhance EFL learning.
Computer Assisted Language Learning, 30(12), 121. doi:10.1080/09588221.2015.1111910
Cheng, L., Ritzhaupt, A. D., & Antonenko, P. (2019). Effects of the flipped classroom instructional
strategy on students’ learning outcomes: A meta-analysis. Educational Technology Research
and Development, 67(4), 793824. doi:10.1007/s11423-018-9633-7
Hung, H.-T. (2015). Flipping the classroom for English language learners to foster active learning.
Computer Assisted Language Learning, 28(1), 8196. doi:10.1080/09588221.2014.967701
Ishikawa, Y., Akahane-Yamada, R., Smith, C., Kondo, M., Tsubota, Y., & Dantsuji, M. (2015). An EFL
flipped learning course design: Utilizing students’ mobile online devices. In F. Helm, L. Bradley, M.
Guarda, & S. Thouësny (Eds.), Critical CALL Proceedings of the 2015 EUROCALL Conference, Padova,
Italy (pp. 261267). Dublin, Ireland:
Mehring, J. (2018). The flipped classroom. In J. Mehring & A. Leis (Eds.), Innovations in flipping the
language classroom: Theories and practices (pp. 110). New York, NY: Springer Berlin Heidelberg
Plonsky, L., & Oswald, F. L. (2014). How big Is “big”? Interpreting effect sizes in L2 research. Language
Learning, 64(4), 878-912. doi:10.1111/lang.12079
Webb, M., & Doman, E. (2016). Does the flipped classroom lead to increased gains on learning outcomes in
ESL/EFL contexts? CATESOL Journal, 28(1), 3967.
Thank you!
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Given that an effect size of d = .4 is a good first estimate of the smallest effect size of interest in psychological research, we already need over 50 participants for a simple comparison of two within-participants conditions if we want to run a study with 80% power. This is more than current practice. In addition, as soon as a between-groups variable or an interaction is involved, numbers of 100, 200, and even more participants are needed. As long as we do not accept these facts, we will keep on running underpowered studies with unclear results. Addressing the issue requires a change in the way research is evaluated by supervisors, examiners, reviewers, and editors. The present paper describes reference numbers needed for the designs most often used by psychologists, including single-variable between-groups and repeated-measures designs with two and three levels, two-factor designs involving two repeated-measures variables or one between-groups variable and one repeated-measures variable (split-plot design). The numbers are given for the traditional, frequentist analysis with p < .05 and Bayesian analysis with BF > 10. These numbers provide researchers with a standard to determine (and justify) the sample size of an upcoming study. The article also describes how researchers can improve the power of their study by including multiple observations per condition per participant.
Full-text available
The flipped classroom instructional strategy is thought to be a good way to structure learning experiences to improve student learning outcomes. Many studies have been conducted to examine the effects of flipped classroom on student learning outcomes compared to the traditional classroom, but the results were inconclusive. The purpose of this study was to examine the overall effect of the flipped classroom instructional strategy on student learning outcomes in relation to a set of moderating variables including student levels, publication types, study durations, and subject area. This meta-analysis examined studies that compared classrooms that used the flipped classroom instructional strategy and classrooms that did not. Seventeen databases were searched to identify literature meeting our inclusion criteria and resulted in 55 publications with 115 effect size comparisons on cognitive student learning outcomes published between 2000 and 2016. Overall, we found a statistically significant effect size (g = 0.193; p < .001; with a 95% confidence interval of 0.113–0.274) in favor of the flipped classroom instructional strategy. The effect size data were normally distributed and exhibited statistically significant heterogeneity. The effect sizes were significantly moderated by subject area such as mathematics, science, social sciences, engineering, arts and humanities, health, and business. No evidence of publication bias was detected in these data. A full discussion of the findings and implications for educational practice and research were provided.
Full-text available
This research investigates whether the flipped classroom can lead students to increased gains on learning outcomes in 2 ESL/EFL contexts in Macau, China, and the US. A pretest posttest quasi-experimental mixed-methods design (N = 64) was used to determine any differences in student achievement that might be associated with the flipped approach (FA). The effectiveness of the FA on students' achievement with grammar-student learning outcomes was evaluated with a pretest and posttest grammar test, along with students' perceptions of their increased comfort and confidence using English grammar through a grammar survey. These data were triangulated with student focus groups and means of completed grammar assignments. The findings suggest that although both the control and experimental groups showed increased comfort in the self-report data, gains on actual achievement were significant only for the experimental groups. The researchers of this study make recommendations for a flipped curriculum and materials design for ESL/EFL teachers in any context globally.
Full-text available
This report presents a review of the statistical practices of 30 journals representative of the second language field. A review of 150 articles showed a number of prevalent statistical violations including incomplete reporting of reliability, validity, non-significant results, effect sizes, and assumption checks as well as making inferences from descriptive statistics and failing to correct for multiple comparisons. Scopus citation analysis metrics and whether a journal is SSCI-indexed were predictors of journal statistical quality. No clear evidence was obtained to favor the newly introduced CiteScore over SNIP or SJR. Implications of the results are discussed.
Conference Paper
This paper reports on a research project in a university English as Foreign Language (EFL) program in Japan which explored ways to sustain active participation in e-learning tasks. The tasks were intended to improve students’ scores on the Test of English for International Communication (TOEIC), a test used by businesses to make hiring decisions. The research adopted a Flipped Learning (FL) approach to Blended Learning (BL). A web-based courseware, ATR CALL BRIX (, which featured e-learning materials for the TOEIC Test, was used. The students used mobile devices to access the courseware before class in order to prepare for in-class teacher-student analysis of their performance on the learning tasks. The teaching methodology integrated the online and in-class tasks in a single learning environment by means of an e-mentoring system used in conjunction with an in-class student self-evaluation task. The findings of pre- and post-TOEIC testing showed a significant degree of TOEIC score improvement in the experimental group. Post-course evaluations revealed that the combination of e-mentoring and the in-class self-evaluation system had encouraged sustained engagement in outside-of-class learning activities. Keywords: EFL, blended learning, flipped learning, MALL.
This paper describes a structured attempt to integrate flip teaching into language classrooms using a WebQuest active learning strategy. The purpose of this study is to examine the possible impacts of flipping the classroom on English language learners’ academic performance, learning attitudes, and participation levels. Adopting a quasi-experimental design, three different formats for flip teaching were developed in this study. The results indicate that the structured and semi-structured flip lessons were more effective instructional designs than the non-flip lessons. With a varying extent, both the structured and semi-structured flip lessons helped the students attain better learning outcomes, develop better attitudes toward their learning experiences, and devote more effort in the learning process. Given the positive results, this paper concludes with a call for more research into this promising pedagogy to contribute to its knowledge base across disciplines.