ArticlePDF Available

The seven sins of L2 research: A review of 30 journals’ statistical quality and their CiteScore, SJR, SNIP, JCR Impact Factors

Authors:

Abstract and Figures

This report presents a review of the statistical practices of 30 journals representative of the second language field. A review of 150 articles showed a number of prevalent statistical violations including incomplete reporting of reliability, validity, non-significant results, effect sizes, and assumption checks as well as making inferences from descriptive statistics and failing to correct for multiple comparisons. Scopus citation analysis metrics and whether a journal is SSCI-indexed were predictors of journal statistical quality. No clear evidence was obtained to favor the newly introduced CiteScore over SNIP or SJR. Implications of the results are discussed.
Content may be subject to copyright.
https://doi.org/10.1177/1362168818767191
Language Teaching Research
2019, Vol. 23(6) 727 –744
© The Author(s) 2018
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/1362168818767191
journals.sagepub.com/home/ltr
LANGUAGE
TEACHING
RESEARCH
The seven sins of L2 research:
A review of 30 journals’
statistical quality and their
CiteScore, SJR, SNIP, JCR
Impact Factors
Ali H. Al-Hoorie
Jubail Industrial College, Saudi Arabia
Joseph P. Vitta
Queen’s University Belfast, UK; Rikkyo University – College of Intercultural Communication, Japan
Abstract
This report presents a review of the statistical practices of 30 journals representative of the
second language field. A review of 150 articles showed a number of prevalent statistical violations
including incomplete reporting of reliability, validity, non-significant results, effect sizes, and
assumption checks as well as making inferences from descriptive statistics and failing to correct
for multiple comparisons. Scopus citation analysis metrics and whether a journal is SSCI-indexed
were predictors of journal statistical quality. No clear evidence was obtained to favor the newly
introduced CiteScore over SNIP or SJR. Implications of the results are discussed.
Keywords
citation analysis metrics, CiteScore, JCR Impact Factor, journal quality, quantitative research,
second language, SJR, SNIP
I Introduction
Second language (L2) researchers have long been interested in improving the quantita-
tive rigor in the field. In an early study, Brown (1990) pointed out the need to improve
quantitative quality in the field, singling out the importance of using ANOVA as opposed
Corresponding author:
Joseph P. Vitta, Queen’s University Belfast, University Road, Belfast BT7 1NN, UK.
Email: jvitta01@qub.ac.uk
767191LTR0010.1177/1362168818767191Language Teaching ResearchAl-Hoorie and Vitta
research-article2018
Article
728 Language Teaching Research 23(6)
to multiple t-tests. This recommendation may now be common knowledge to many
researchers, pointing to the fact that our field has made substantial progress in statistical
practices over the decades (see Plonsky, 2014). The field has now moved to relatively
more advanced topics, including the need for a priori power calculation and assumption
checking (e.g. Larson-Hall, 2016; Norris et al., 2015; Plonsky, 2015). The analysis of
quantitative practices in the field is currently an active area of research, as ‘there is no
controversy over the necessity of rigorous quantitative methods to advance the field of
SLA (Plonsky, 2013, p. 656).
In addition to the goal of improving quantitative quality, there has also been an inter-
est in overall journal quality, both within the L2 field and in academia in general (see
Egbert, 2007; Garfield, 2006; Vitta & Al-Hoorie, 2017). Indexing has often been seen as
a key factor in journal evaluation, with Scopus and the Web of Science being the two
indexes of prestige (Chadegani et al., 2013; Guz & Rushchitsky, 2009). Within these two
catalogues, citation analysis metrics have been employed as efficient measurements of
overall journal quality for some time. Garfield (2006), for example, noted that the Web
of Science’s Impact Factor has been in use since the 1950s as the index’s citation analysis
metric. At the same time, indexing and citation analysis have not universally been
accepted as a definitive way to assess journal quality within a field (e.g. Brumback,
2009; Egbert, 2007). This is primarily due to the notion that citation quantifies only one
aspect in the overall evaluation of a journal and might miss other, perhaps equally impor-
tant, aspects of journal quality.
Based on similar considerations, Plonsky (2013) has suggested the need for investi-
gating the relationship between statistical practices and journal quality in the L2 field. To
date, such an investigation does not seem to have been performed yet. The current study
therefore aimed to address this gap. A total of 150 quantitative articles from 30 L2 jour-
nals were assessed for quantitative rigor. The relationship between each journal’s quan-
titative quality and a number of popular journal quality measurements relating to citation
analysis and indexing were then examined.
II Overview
1 Quantitative rigor
In recognition of the importance of quantitative knowledge in the second language (L2)
field, a number of researchers have recently investigated various aspects related to meth-
odological and statistical competence of L2 researchers. Loewen and collegues (2014),
for instance, found that only 30% of professors in the field report satisfaction with their
level of statistical training, while only 14% doctoral students do so. Loewen and col-
leagues (2017) subsequently extended this line of research by using independent meas-
ures of quantitative competence, rather than simply relying on self-reported knowledge,
and also found significant gaps in the statistical literacy of L2 researchers. Furthermore,
quality of research design does not seem a priority for some scholars in the field when
evaluating the status of different journals (Egbert, 2007).
Inadequate quantitative knowledge is likely going to be evident in the field’s publica-
tions. In one of the first empirical assessments of the methodological quality in published
Al-Hoorie and Vitta 729
L2 research, Plonsky and Gass (2011) investigated quantitative reports in the L2 interac-
tion tradition spanning a period of around 30 years. They observed that ‘weaknesses in
the aggregate findings appear to outnumber the strengths’ (p. 349) and speculated
whether this is in part due to inadequate training of researchers. Subsequently, Plonsky
(2013, 2014) investigated articles published in two major journals in the field (Language
Learning and Studies in Second Language Acquisition). In line with previous findings,
these studies identified a number of limitations prevalent in the field, such as lack of
control in experimental designs and incomplete reporting of results. In a more recent
study, Lindstromberg (2016) examined articles published in one journal (Language
Teaching Research) over a period of about 20 years. This study also found issues similar
to those obtained in previous studies, such as incomplete reporting and overemphasizing
significance testing over effect sizes.
2 Unpacking the ‘sins’ of quantitative research
A large number of topics fall under quantitative research, making a parsimonious clas-
sification of these topics no easy task (see Stevens, 2009; Tabachnick & Fidell, 2013). In
this article, we adopt an approach similar to that used by Brown (2004), where statistical
topics are classified into three broad areas: psychometrics, inferential testing, and
assumption checking.
Psychometrics – subsuming reliability and validity – has been of major concern to
past research into the quantitative rigor in the field. Larson-Hall and Plonsky (2015)
pointed out that the most commonly used reliability coefficients in the field are
Cronbach’s α (for internal consistency) and Cohen’s κ (interrater reliability). Norris
et al. (2015) echoed the call for researchers to address reliability while also arguing
that researchers provide ‘validity evidence’ (p. 472). Validity evidence is of equal
importance to reliability, though validity is trickier since it lacks a straightforward
numeric value that researchers can report. For this reason, Norris et al. (2015) sug-
gested the use of conceptual validity evidence, such as pilot studies and linking instru-
ments to past research. Empirical validity evidence can also be utilized, such as factor
analysis for construct validity (Tabachnick & Fidell, 2013). In L2 research, reliability
has usually been emphasized, while validity considerations have been somewhat over-
looked. For example, studies using the Strategy Inventory for Language Learning
(Oxford, 1990) often reported internal reliability (see Hong-Nam & Leavell, 2006;
Radwan, 2011), despite calls for the need to also consider validity evidence for this
instrument (Tseng, Dörnyei, & Schmitt, 2006).
When it comes to inferential testing, a number of issues have been pointed out in
previous methodological research. One of these issues is, obviously, the need to use
inferential statistics. Descriptive statistics alone can sometimes be informative (Larson-
Hall, 2017), but for most purposes it is not clear whether an observed trend is merely a
natural fluctuation that should not be overinterpreted. Inferential statistics, by definition,
permits the researcher to generalize the observed trend from the sample at hand to the
population. In their investigation of interaction research, Plonsky and Gass (2011) report
that around 6% did not perform any inferential tests. Similarly, Plonsky (2013) reports in
a subsequent study that 12% of the studies in the sample did not use inferential testing.
730 Language Teaching Research 23(6)
A second issue concerning inferential testing has to do with complete reporting of the
results (Larson-Hall & Plonsky, 2015). A number of methodologists (e.g. Brown, 2004;
Nassaji, 2012; Norris et al., 2015; Plonsky, 2013; 2014, among others) have emphasized
that inferential tests must have all their relevant information presented for transparency
and replicability. In the case of t-tests, for example, readers would need information
regarding means, standard deviations, degrees of freedom, t-value, and p-value.
Effect sizes, which describe the magnitude of the relationship among variables, is a
further required aspect of quantitative rigor. In arguing for the centrality of effect sizes,
Norris (2015) asserted that L2 researchers have a tendency to conflate statistical signifi-
cance with practical significance. In the same vein, Nassaji (2012) posited that our field
has yet to firmly grasp that p-values only speak to Type I error probability and not the
strength of association between dependent and independent variables (see Cohen, Cohen,
West, & Aiken, 2003). Effect size reporting has therefore been stressed by L2 method-
ologists in recent years (e.g. Larson-Hall & Plonsky, 2015; Plonsky, 2013, 2014; Plonsky
& Gass, 2011; Plonsky & Oswald, 2014).
In addition to the above, a common situation in inferential testing is when researchers
perform several tests. Brown (1990) suggests that researchers should employ ANOVA as
an alternative to multiple t-tests in order to control Type I error rate. Norris and colleagues
(Norris, 2015; Norris et al., 2015) also highlight the need to perform a correction for alpha
level in multiple companions. A common procedure is the Bonferroni correction, where the
alpha level is divided by the number of tests performed. As an illustration, if 10 t-tests are
performed simultaneously in one study, the alpha level becomes .05/10 = .005. In this
example, a result is considered significant only if the p-value is less than .005. Procedures
that are less conservative than the Bonferroni correlation has also been proposed (e.g.
Holm–Bonferroni and Benjamini–Hochberg; see Larson-Hall, 2016; Ludbrook, 1998).
Finally, an essential consideration in quantitative research is checking that the necessary
assumptions are satisfied. Although it is typically classified under inferential statistics (i.e.
to determine whether parametric tests are appropriate), this point is placed in a separate
category in the present article for two reasons. First, checking assumptions is not limited to
inferential testing but also applies to descriptive statistics. For example, reporting the mean
and standard deviation assumes that the data are normally distributed. Otherwise, the mean
and standard deviation would not be representative of the central tendency and dispersion
of the data, respectively. Psychometrics also have assumptions that need to be satisfied. For
example, Cronbach’s alpha assumes that the construct is unidimensional (Green, Lissitz, &
Mulaik, 1977), or else its value could be inflated. Second, assumption checking seems
consistently overlooked in L2 research, despite repeated calls emphasizing its importance
(e.g. Lindstromberg, 2016; Loewen & Gass, 2009; Nassaji, 2012; Norris, 2015; Norris
et al., 2015; Plonsky, 2014; Plonsky & Gass, 2011). In the present study, the violations
reviewed above are called the seven ‘sins’ of quantitative research (see Table 1).
III Journal quality
Discussion of the methodological rigor of research articles ultimately speaks to the qual-
ity of the field’s journals. Assessment of journal quality is of particular importance
because of the value different stakeholders in both academic and professional arenas
Al-Hoorie and Vitta 731
place on research found in them (Weiner, 2001). Egbert (2007) surveyed multiple ways
to gauge journal quality in the L2 field, such as citation analysis, rejection rate, time to
publication, and expert opinion.
In reality, however, citation analysis has been one of the most common means of
evaluating different journals, probably because it offers a simple numeric impact value
to rank each journal (see Brumback, 2009; Colledge et al., 2010; Leydesdorff & Opthof,
2010). The history of citation analysis metrics dates back to the 1950s (see Garfield,
2006). Academia, in general, has been eager to embrace a means to gauge journal quality
empirically (Weiner, 2001). Currently, the most commonly used citation analysis metrics
are Source Normalized Impact per Paper (SNIP), SCImago Journal Rank (SJR), and
Journal Citation Reports (JCR) Impact Factor. The former two are maintained by
Elsevier’s Scopus, while JCR is maintained by Clarivate’s Web of Science (WOS; for-
merly Thompson-Reuters). WOS also maintains the Social Sciences Citation Index
(SSCI), which is the most relevant to our field.
Intense competition exists between these two indexing services, resulting in continuous
improvement of their metrics (see Chadegani et al., 2013; Guz & Rushchitsky, 2009). As
part of this development, Scopus has recently unveiled a new metric called CiteScore (see
da Silva & Memon, 2017), which is calculated in a similar way to JCR except that the look-
back period is three years rather than two. Table 2 presents an overview of these metrics.
Metrics judging journal quality through citation analysis have come under heavy criti-
cism (Brumback, 2009). Some have expressed doubt about the viability of using one
metric to assess the various dimensions contributing to a journal’s quality (Egbert, 2007).
Some of these metrics are proprietary and their actual calculations are not made public,
making them unverifiable. In fact, there have been reports of editors and publishers
negotiating and manipulating these metrics to improve journal rankings (see Brumback,
2009). Nevertheless, citation analysis metrics remain the primary means of assessing
journal quality, thus governing employment, tenure, funding, and livelihood for many
researchers around the world.
In the L2 field, there have been attempts to evaluate the different journals available.
In one study, Jung (2004) aimed to rank the prestige of a number of journals. Jung’s
primary criteria were being teacher-oriented and being indexed by the prestigious Centre
for Information on Language Teaching and Research, then hosted in Language Teaching
Table 1. The seven sins of quantitative research.
Area Violation
Psychometrics 1. Not reporting reliability
2. Not discussing validity
Inferential statistics 3. Making inferences from descriptive statistics
4. Incomplete reporting, including non-significant results
5. Not reporting effect sizes
6. Not adjusting for multiple comparisons
Other 7. Not reporting assumption checks
732 Language Teaching Research 23(6)
journal (n = 12). In another study, Egbert (2007) created a list of field-specific journals
primarily based on expert opinion (n = 35). Benson, Chik, Gao, Huang, and Wang (2009)
also employed a list of journals (n = 10) for the purpose of evaluating these journals in
relation to qualitative rigor. In all of these studies, journals’ citation analysis values have
not been systematically taken into consideration in evaluating these journals.
IV The present study
As reviewed above, previous research on quantitative quality in the L2 field has tended
to focus on specific publication venues (e.g. one or two particular journals) or specific
research traditions (e.g. L2 interaction). Studies with a wider scope, on the other hand,
have not focused on either quantitative quality or its relation to popular journal citation
analysis. In this study, we aimed to conduct a broader review covering a representative
sample of L2 journals. This would allow us to obtain a more general overview of study
quality in the L2 field, as well as to empirically assess the utility of the four citation
analysis metrics as a representation of quality of different journals.
Because of this wide coverage, we also narrowed our scope to focus specifically on
statistical issues rather than more general methodological issues. For example, prior
research has repeatedly shown that many L2 researchers usually overlook important con-
siderations such as power analysis and using a control group (e.g. Lindstromberg, 2016;
Plonsky, 2013, 2014). However, it must be noted that such considerations are design
issues that need to be addressed before conducting the study. Thus, having an adequate
sample size or a control group can sometimes be governed by practical and logistical
considerations that are outside the researcher’s control. Because of the wide coverage of
our sample of journals, involving various L2 sub-disciplines where practical limitations
Table 2. Four common journal citation analysis metrics and their characteristics.
Impact
factor
Indexing service Description
SNIP
Scopus
Number of citations to a journal’s articles in the past three
years divided by the total number of its articles in the
past three years. Normalized to facilitate cross-discipline
comparisons (Colledge et al., 2010).
SJR Essentially SNIP calculations that were additionally weighted,
depending on the rank of the citing journal, while excluding
self-citations. The weighting uses the PageRank algorithm
(Guerrero-Bote & Moya-Anegón, 2012).
CiteScore Total number of a journal’s citations in a given year divided by
the journal’s total number of citable publications during the
past three years (da Silva & Memon, 2017).
JCR WOS Total number of a journal’s citations in a given year divided by
the journal’s total number of citable publications during the
past two years (Garfield, 2006).
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; WOS = Web of Science.
Al-Hoorie and Vitta 733
may be an essential part of everyday research, we limited our review to data analysis
issues specifically. We aimed to answer three research questions:
1. What are the most common statistical violations found in L2 journals?
2. What is the relationship between the journal’s statistical quality and its citation
analysis scores (SNIP, SJR, CiteScore, and JCR)?
3. What is the relationship between the journal’s statistical quality and its indexing
(SSCI vs. Scopus)?
V Method
1 Inclusion criteria
In order to be included in this study, the journal had to satisfy the following criteria:
1. The journal is indexed by Scopus or SSCI.
2. The journal is within the second/foreign language learning and teaching area.
3. The journal presents original, quantitative research.
4. The journal uses English as its primary medium.
2 Journal search
The titles of journals indexed by Scopus and SSCI were examined against a list of
keywords representing various interests in the L2 field. Three experts were consulted
over two iterations to develop and validate the list of keywords (for the complete list,
see Appendix 1). A few well-known journals were not captured by our keywords (e.g.
System) and these were subsequently added. The final list of journals satisfying all
inclusion criteria included 30 journals (for the complete list, see Appendix 2). All jour-
nals were indexed in Scopus but only 19 were additionally indexed in SSCI. Our sam-
ple of 30 journals was larger than previous samples by Jung (2004, n = 12) and by
Benson et al. (2009, n = 10). It was slightly smaller than that by Egbert (2007, n = 35),
but this was primarily because our sample was limited to journals indexed in either
Scopus or SSCI. Therefore, it seems reasonable to argue that our sample is representa-
tive of journals in the L2 field.
Table 3 presents an overview of the journals in our sample and their citation analysis
scores. Both the Kolmogorov–Smirnov and Shapiro–Wilk tests for normality showed
that the four impact factors were normally distributed, ps > .05.
3 Data analysis
Five recent quantitative articles were randomly extracted from each of the 30 journals,
resulting in a total of 150 articles. All articles were published in 2016 or later, thus mak-
ing them representative of the latest quantitative trends in these journals. While a five-
article sample may not be fully representative of trends in a journal over time, our aim
was to investigate quantitative practices found in the most recent publications in the L2
734 Language Teaching Research 23(6)
field. This would allow us to find out the most common statistical violations in recent
literature. We discuss this issue further in the limitations of this study.
Each article was subsequently reviewed by two researchers independently against the
violations of statistical conventions described in Table 1. Controversial topics were
avoided, such as the adequacy of classical test theory vs. item response theory, or explor-
atory factor analysis vs. principal components analysis. In coding violations in each arti-
cle, repeated violations of the same issue were coded as one violation only, so violations
were coded in a binary format (present vs. not present) for each category. Interrater reli-
ability was high (96.7% agreement, κ = .91), and all discrepancies were resolved through
discussion until 100% agreement was reached.
VI Results
Violations were first averaged within each journal, and then the overall mean and stand-
ard deviation were computed (see Table 4). The table also reports the one-sample t-test
that examined whether the observed means were significantly different from zero. All
results were significant and lower than the Bonferroni-adjusted significance level (.05/7
= .007). The effect sizes were also generally substantial.
Since the maximum possible mean was 1.0 for each violation following our binary
coding, the means in the table can also be interpreted as a probability. As an illustration,
the probability of an article having issues with reliability is 24.7% (see Table 4). In other
words, almost one in every four articles would have a reliability issue. On the other hand,
almost every other article makes inferences from descriptive statistics (44%). Similarly,
just over one in three articles does not report effect sizes (38.7%).
Since the normality and linearity assumptions were satisfied, the correlations between
journals’ statistical quality and their citation analysis scores were examined to shed further
light on these results. There was a positive correlation between the statistical quality (vio-
lations were coded here so that a higher value indicated higher quality) of the 30 journals
and their Scopus citation analysis metrics (rSJR = .414, p = .023; rSNIP = .339, p = .067;
rCiteScore = .344, p = .062). In other words, SJR accounted for 17.1% of the variance
observed in journals’ statistical quality while SNIP and CiteScore accounted for 11.5%
and 11.2% of the observed variance, respectively. JCR impact factor was non-significant
for the 19 journals indexed by SSCI, r = –.129, p = .599, accounting for a negligible 1.7%
of the variance.
Table 3. Number of journals included and means and standard deviations of their 2015
citation analysis scores.
nImpact factor M (SD)
Scopus
30 SNIP 1.24 (0.78)
SJR 0.98 (0.77)
CiteScore 1.17 (0.84)
SSCI 19 JCR 1.42 (0.60)
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; SSCI = Social Sciences Citation Index.
Al-Hoorie and Vitta 735
Although the correlation between JCR and journal statistical quality was non-signifi-
cant, an independent samples t-test showed that non-SSCI-indexed journals had signifi-
cantly more violations (M = 11.27, SD = 3.35, n = 11) than SSCI-indexed ones (M = 7.89,
SD = 3.59, n = 19), t(28) = 2.54, p = .017, d = 0.97. These results suggest that SSCI-
indexed journals demonstrate higher quantitative rigor. Table 5 lists L2 journals indexed
by both Scopus and SSCI that demonstrated the fewest violations.1
VII Discussion
This article has presented the results of an analysis of 150 articles derived from a list of 30
journals representative of the L2 field. A number of statistical violations were observed
with varying degrees of frequency. The results also showed that Scopus citation analysis
metrics represent moderate predictors of statistical quality (accounting for around 11–
17% of the variance), with no evidence to favor the newly introduced CiteScore over
SNIP or SJR. Although these metrics account for less than 20% of the variance in the
observed quality of L2 journals, this magnitude might be considered reasonable, consider-
ing that statistical quality is only one dimension factoring in the overall quality of a jour-
nal. Other dimensions include non-statistical components of quantitative articles as well
as other types of articles such as qualitative and conceptual ones. Indeed, ‘a single method,
regardless of the number of components included, could not account for important differ-
ences among journals and in reasons for publishing in them’ (Egbert, 2007, p. 157).
The results also show that JCR was not a significant predictor of journal statistical
quality. This finding does not necessarily imply that this metric is not useful. It may mean
that L2 journals indexed by SSCI do not demonstrate sufficient variation for JCR to cap-
ture it. To become indexed by SSCI is typically a long process in which journals are
expected to demonstrate a certain level of quality. Indeed, the results above showed that
journals indexed in SSCI exhibited fewer statistical violations than non-SSCI journals.
Overall, these results suggest two ways to evaluate L2 journal quality: 1) the journal’s SJR
Table 4. Prevalence of the seven violations emerging from the analysis.
Theme M SD t p d
Psychometrics
1. Not reporting reliability 0.247 0.23 5.81 < .0001 1.06
2. Not discussing validity 0.087 0.13 3.81 .00066 0.70
Inferential statistics
3. Making inferences from descriptive
statistics
0.440 0.29 8.20 < .0001 1.50
4. Incomplete reporting, including non-
significant results
0.253 0.23 5.92 < .0001 1.08
5. Not reporting effect sizes 0.387 0.26 8.25 < .0001 1.51
6. Not adjusting for multiple comparisons 0.140 0.20 3.87 .00056 0.71
Other
7. Not reporting assumption checks 0.253 0.25 5.50 < .0001 1.00
736 Language Teaching Research 23(6)
value and 2) whether the journal is SSCI-indexed. The remainder of this article offers a
brief overview of the most common violations emerging from the present analysis.
1 Psychometrics
It is important for researchers to report details on the reliability and validity of their
instruments. In our analysis, a number of articles did not report the reliability of their
instruments, particularly dependent variables. In situations where manual coding is
involved, it is also important to examine and report interrater reliability as well. This
is especially important when a large amount of data is coded and subjectivity may be
involved.
When multiple scales are used (e.g. as part of a questionnaire), it is also important to
examine the factorial structure of the scales, whether using a procedure from classical
test theory or item response theory. It is common for researchers to adapt existing scales
for their own purposes but without conducting a factor analytic procedure to examine
convergent and discriminant validity of these scales. In some cases, even the original
developers of these scales did not investigate these issues. Some may argue that such
scales have been established in previous research. However, it would seem arbitrary to
require reporting reliability every time a scale is administered, but assume that other
psychometric characteristics can simply be imported from prior research. Reliability also
gives a very limited insight into the psychometric properties of a scale. Green et al.
(1977) offer examples of high reliability (e.g. over .80) that is a mere artifact of a long,
multidimensional scale, while Schmitt (1996) showed that low reliability (e.g. under .50)
is not necessarily problematic (see also Sijtsma, 2009, for a more detailed critique).
A discussion of validity is equally important. In our sample, a common situation
where a discussion of validity was lacking was in the use of authentic variables, such as
school grades. When the researcher draws from such authentic variables, it is typically
Table 5. Journals demonstrating highest quality in the present sample and their 2015 citation
analysis scores.
Journal Scopus SSCI
SNIP SJR CiteScore JCR
Computer Assisted Language Learning 1.54 1.26 1.64 1.72
English for Specific Purposes 2.73 1.66 2.11 1.14
International Review of Applied Linguistics in Language Teaching 0.97 0.91 0.95 0.80
Language Assessment Quarterly 0.63 1.07 0.93 0.98
Language Learning 2.54 2.47 2.58 1.87
Language Testing 1.36 1.44 1.50 0.91
Modern Language Journal 1.13 1.15 1.54 1.19
Studies in Second Language Acquisition 1.41 2.49 1.99 2.23
TESOL Quarterly 1.43 1.46 1.58 1.51
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; SSCI = Social Sciences Citation Index.
Al-Hoorie and Vitta 737
out of the researcher’s hand to control for their reliability and validity. In such cases,
readers would at least need a detailed description of the variable, its characteristics, and
the circumstances surrounding its measurement in order to evaluate its adequacy for the
purpose of the article, such as whether grades are a fair reflection of proficiency in the
context in question. This information may also be helpful in resolving inconsistent results
when they emerge from different studies.
When researchers develop their own instruments, extra work is required. Instrument
development should be treated as a crucial stage in a research project. Researchers devel-
oping a new instrument should perform adequate piloting to improve the psychometric
properties of the instrument before the actual study starts in order to convince the reader
of its outcomes. Poor instruments risk misleading results.
2 Inferential statistics
One of the most common issues arising in our analysis was the tendency to make infer-
ences from descriptive statistics. It is important to be aware of the distinction between
descriptive statistics and inferential statistics. Descriptive statistics refer to the character-
istics of the sample in hand. These characteristics could potentially be idiosyncratic to
this specific sample and not generalizable to the population it was sampled from.
Inferential statistics help decide whether these characteristics are generalizable to the
population, primarily through a trade-off between the magnitude of the descriptive sta-
tistic (e.g. mean difference between two groups) and the size of the sample.
Descriptive statistics alone may be useful in describing some general trends. However,
in most cases, without inferential statistics it may not be clear whether the pattern
observed is genuine or resulting from chance. This applies to all descriptive statistics,
such as means, standard deviations, percentages, frequencies, and counts. Researchers
reporting any of these statistics should consider an appropriate inferential test before
making generalizations to the population. In certain situations, it might at first seem hard
to think of an appropriate inferential test, but it is the researcher’s responsibility to dem-
onstrate to the reader that the results are generalizable to the population. Ideally, the
decision on which test to use should be made at the design stage before conducting the
study (e.g. preregistration).
In our sample, we found three common situations where inferences were frequently
made without inferential statistics. The first was when only one sample was involved. In
such situations, the researcher might consider a one-sample test to tell whether the statis-
tic is significantly different from zero (as was done in the present study). In some cases,
this might be a mundane procedure, but it is a first step toward calculating the size of the
effect (see below), which is typically a more interesting question. The second situation
arising from our analysis had to do with count data. A number of articles reported counts
of certain phenomena (e.g. number of certain linguistic features in a corpus), and then
made inferences based on those counts. In such cases, the researcher might consider the
chi-square test for independent groups and the McNemar test for paired groups. Rayson,
Berridge, and Francis (2004) have also suggested the log likelihood test for comparing
observed corpus frequencies. The third, more subtle, situation was when researchers
compared two test statistics, such as two correlation coefficients. In these situations, the
738 Language Teaching Research 23(6)
two coefficients might be different but the question is whether this difference is itself
large enough to be statistically significant. In fact, even if one coefficient were signifi-
cant and the other were not, this would not be sufficient to conclude that the difference
between them would also be significant (see Gelman & Stern, 2006). In this case, Fisher’s
r-to-z transformation could be used.
Another issue arising from our results is incomplete reporting of results, including
non-significant ones. A number of articles presented their results in detail if they were
significant, but the non-significant results were abridged and presented only in passing.
Regardless of the outcome of the significance test, the results should be reported in full,
including descriptive statistics, test statistics, degrees of freedom (where relevant), p-val-
ues, and effect sizes. Failing to report non-significant results can lead to publication bias,
in which only significant result become available to the research community (Rothstein,
Sutton, & Borenstein, 2005). Failing to report non-significant results in full may also
preclude the report from inclusion in future meta-analyses.
In our analysis, we did not consider it a violation if confidence intervals were not
reported. Although there have been recent calls in the L2 field to report confidence inter-
vals to address some limitations of significance testing (e.g. Lindstromberg, 2016; Nassaji,
2012; Norris, 2015), this issue is less straightforward than it might seem at first. This is due
to the controversy surrounding the interpretation of confidence intervals. Since they were
first introduced (Neyman, 1937), confidence intervals have never been intended to repre-
sent either the uncertainty around the result, its precision, or its likely values. Morey,
Hoekstra, Rouder, Lee, and Wagenmakers (2016) refer to such interpretations as the fallacy
of placing confidence in confidence intervals, which is prevalent among students and
researchers alike (Hoekstra, Morey, Rouder, & Wagenmakers, 2014). As a matter of fact,
confidence intervals of a parameter refer to the interval that, in repeated sampling, has on
average a fixed (e.g. 95%) probability of containing that parameter. Confidence intervals
therefore concern the probability in the long run, and may not be related to the results from
the original study. Some statisticians have even gone as far as to describe confidence inter-
vals as ‘scientifically useless’ (Bolstad, 2007, p. 228). Whether the reader would agree with
this evaluation or consider it rather extreme, our aim is to point out that confidence inter-
vals are far from the unanimously accepted panacea for significance testing ills.
In contrast to confidence intervals, there is far more agreement among methodologists
on the need for reporting effect sizes to complement significance testing. The discussion
so far has mentioned significance and significance testing several times, probably giving
the impression of the meaningfulness of this procedure. Actually, a significant result is a
relatively trivial outcome. Because the null hypothesis is always false (Cohen, 1994),
just increase your sample size if you want a significant result! A p-value is the probability
of the data given the null hypothesis. When the result is significant at the .05 level, we
can conclude that the probability of obtaining this outcome by chance is only 5% if the
null hypothesis is true. It does not mean that the effect is big or strong. At the same time,
a non-significant result does not represent evidence against the hypothesis, as it is pos-
sible that the study was underpowered. To obtain evidence against a hypothesis, the
researcher needs to calculate the power of the test and then the effect size to demonstrate
that there is no ‘non-negligible’ effect (see Cohen et al., 2003; Lakens, 2013), or alterna-
tively use Bayes factors (see Dienes, 2014).
Al-Hoorie and Vitta 739
The final point discussed in this section is not adjusting for multiple comparisons. As
mentioned above, a p-value gives the probability of obtaining the data if the null is true.
It therefore does not refer to the probability of the hypothesis itself. With multiple com-
parisons, the likelihood of obtaining a significant result by mere chance no longer
remains at 5%, thus raising the risk of Type I error. One way to address this problem is
to implement an appropriate correction procedure (Larson-Hall, 2016; Ludbrook, 1998).
Another approach is to determine the specific tests to be conducted beforehand. Any
other analyses conducted should then be labeled explicitly as exploratory, since their
results could potentially reflect Type I error. Perhaps the worst a researcher could do in
this regard is to conduct various tests and report only those that turn out significant.
3 Other issues
In our analysis, a number of issues emerged related to not reporting assumption checks
before conducting particular procedures. This was placed in a separate category because
it is used here in a broad sense to refer to both descriptive and inferential statistics, as
well as psychometrics. For example, some articles used nonparametric tests due to non-
normality of the data but also reported the mean and standard deviation to describe their
data. Using the mean and standard deviation assumes that the data are normal. Many
articles also used inferential tests that require certain assumptions, such as normality and
linearity, but without assuring the reader that these assumptions were satisfied. Other
articles that performed factor analysis did not report factor loadings fully or address the
implications of cross-loadings. In many cases, many concerns can be addressed by sim-
ply making the dataset publicly available.
A particularly overlooked assumption is correlated errors. Many statistical procedures
assume that errors are uncorrelated. For example, when learners come from distinct
classes, learners from the same class will be more similar to each other than those from
different classes. When this happens, the sample is no longer random. As a consequence,
Type I error rate increases, such as when learners from one class in the group have higher
scores because of some unique feature of that class. In this case, the overall group mean
will be inflated because of one class only. The effect of violating this independence
assumption might be mild when there are only a few classes. But with more classes (e.g.
over 20), the effect could be more serious. One approach to deal with this situation is to
use classes as the unit of analysis by averaging the values for learners within each class.
Another approach is to use multilevel and mixed-effects modeling to model both higher
and lower units simultaneously (Hox, 2010).
VIII Conclusions
This article has presented a review of 30 journals representative of the L2 field. The review
focused on statistical issues specifically, rather than methodological issues. It was not
intended to downplay the importance of methodological issues, such as an adequate sample
size based on a priori power calculation or including a control group in (quasi)experimental
designs. Instead, we focused on statistical issues only because they seemed relevant to a
broader section the field, including areas fraught with practical constraints.
740 Language Teaching Research 23(6)
The results showed that Scopus’s citation analysis metrics function as moderate pre-
dictors of L2 journal quality, accounting for around 11–17% of the observed variance in
journals’ statistical quality, thus providing no evidence in favor of the newly introduced
CiteScore over the other metrics, at least in our field. Another indicator of journal quality
is whether the journal is SSCI-indexed. SSCI’s JCR was not a significant predictor of
journal quality, probably due to the small variation among SSCI-indexed journals, most
of which showing high quality in the L2 field. The analysis also revealed a number of
prevalent statistical violations that were surveyed in this article. Future research should
investigate other aspects of journal quality (i.e. other than statistical) to examine their
relationship with journal indexing and citation analysis metrics.
The present study is not without limitations. Our sample of 30 journals was rather
small. However, we were limited by the available number of L2 journals that are indexed
by Scopus and SSCI. For this reason, we did not have the luxury to conduct a power
analysis and then obtain a sufficiently large sample. Nevertheless, our study is still one
of the largest quantitative surveys of L2 journals in the field to date. A further limitation
is whether selecting five articles from a journal would be truly representative of that
journal. In our case, in addition to aiming to investigate the most recent quantitative
trends in journals, we were also bounded by practical constraints. A total of 150 journal
articles to read and analyse is no easy feat. Instead of recommending that future research-
ers must have a larger sample than ours, an alternative approach is to conduct compara-
ble studies on more recent literature and then combine the results meta-analytically. This
would help build a cumulative science of journal quality in the field.
Funding
This research received no specific grant from any funding agency in the public, commercial, or
not-for-profit sectors.
Note
1. We do not claim that other journals not listed in Table 5 necessarily have lower quality
because our sample was not exhaustive of journals in the field and because it included only
five recent articles from each journal. In fact, even for journals listed in Table 5, we do not
recommend researchers interested in improving their statistical literacy to browse older issues
of these journals, since quality (and editorial policies) change over time.
ORCID iDs
Ali H. Al-Hoorie https://orcid.org/0000-0003-3810-5978
Joseph P. Vitta https://orcid.org/0000-0002-5711-969X
References
Benson, P., Chik, A., Gao, X., Huang, J., & Wang, W. (2009). Qualitative research in language
teaching and learning journals, 1997–2006. The Modern Language Journal, 93, 79–90.
Bolstad, W.M. (2007). Introduction to Bayesian Statistics. 2nd edition. Hoboken, NJ: Wiley.
Brown, J.D. (1990). The use of multiple t tests in language research. TESOL Quarterly, 24, 770–773.
Brown, J.D. (2004). Resources on quantitative/statistical research for applied linguists. Second
Language Research, 20, 372–393.
Al-Hoorie and Vitta 741
Brumback, R.A. (2009). Impact factor wars: Episode V: The empire strikes back. Journal of Child
Neurology, 24, 260–262.
Chadegani, A.A., Salehi, H., Yunus, M.M., et al. (2013). A comparison between two main aca-
demic literature collections: Web of science and scopus databases. Asian Social Science, 9,
18–26.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Cohen, J., Cohen, J., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences. 3rd edition. Mahwah, NJ: Lawrence Erlbaum.
Colledge, L., de Moya-Anegón, F., Guerrero-Bote, V., et al. (2010). SJR and SNIP: Two new
journal metrics in Elsevier’s Scopus. Serials, 23, 215–221.
da Silva, J.A.T., & Memon, A.R. (2017). CiteScore: A cite for sore eyes, or a valuable, transparent
metric? Scientometrics, 111, 553–556.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in
Psychology, 5. Available online at http://doi.org/10.3389/fpsyg.2014.00781 (assessed March
2018).
Egbert, J.O.Y. (2007). Quality analysis of journals in TESOL and applied linguistics. TESOL
Quarterly, 41, 157–171.
Garfield, E. (2006). The history and meaning of the journal impact factor. JAMA, 295, 90–93.
Gelman, A., & Stern, H.S. (2006). The difference between ‘significant’ and ‘not significant’ is not
itself statistically significant. The American Statistician, 60, 328–331.
Green, S.B., Lissitz, R.W., & Mulaik, S.A. (1977). Limitations of coefficient alpha as an index of
test unidimensionality. Educational and Psychological Measurement, 37, 827–838.
Guerrero-Bote, V.P., & Moya-Anegón, F. (2012). A further step forward in measuring journals’
scientific prestige: The SJR2 indicator. Journal of Informetrics, 6, 674–688.
Guz, A.N., & Rushchitsky, J.J. (2009). Scopus: A system for the evaluation of scientific journals.
International Applied Mechanics, 45, 351–362.
Hoekstra, R., Morey, R.D., Rouder, J.N., & Wagenmakers, E.-J. (2014). Robust misinterpretation
of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164.
Hong-Nam, K., & Leavell, A.G. (2006). Language learning strategy use of ESL students in an
intensive English learning context. System, 34, 399–415.
Hox, J.J. (2010). Multilevel analysis: Techniques and applications. 2nd edition. New York:
Routledge.
Jung, U.O.H. (2004). Paris in London revisited or the foreign language teacher’s top-most jour-
nals. System, 32, 357–361.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practi-
cal primer for t-tests and ANOVAs. Frontiers in Psychology, 4. Available online at http://doi.
org/10.3389/fpsyg.2013.00863 (assessed March 2018).
Larson-Hall, J. (2016). A guide to doing statistics in second language research using SPSS and R.
2nd edition. New York: Routledge.
Larson-Hall, J. (2017). Moving beyond the bar plot and the line graph to create informative and
attractive graphics. The Modern Language Journal, 101, 244–270.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings:
What gets reported and recommendations for the field. Language Learning, 65, 127–159.
Leydesdorff, L., & Opthof, T. (2010). Scopus’s source normalized impact per paper (SNIP) versus
a journal impact factor based on fractional counting of citations. Journal of the American
Society for Information Science and Technology, 61, 2365–2369.
Lindstromberg, S. (2016). Inferential statistics in Language Teaching Research: A review and
ways forward. Language Teaching Research, 20, 741–768.
742 Language Teaching Research 23(6)
Loewen, S., & Gass, S. (2009). The use of statistics in L2 acquisition research. Language Teaching,
42, 181–196.
Loewen, S., Crowther, D., Isbell, D., Lim, J., Maloney, J., & Tigchelaar, M. (2017). The statisti-
cal literacy of applied linguistics researchers. Unpublished paper presented at the American
Association for Applied Linguistics (AAAL), Portland, Oregon, USA.
Loewen, S., Lavolette, E., Spino, L.A., et al. (2014). Statistical literacy among applied linguists
and second language acquisition researchers. TESOL Quarterly, 48, 360–388.
Ludbrook, J. (1998). Multiple comparison procedures updated. Clinical and Experimental
Pharmacology and Physiology, 25, 1032–1037.
Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., & Wagenmakers, E.-J. (2016). The fallacy
of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23, 103–123.
Nassaji, H. (2012). Statistical significance tests and result generalizability: Issues, misconceptions,
and a case for replication. In G.K. Porte (Ed.), Replication research in applied linguistics (pp.
92–115). Cambridge: Cambridge University Press.
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of prob-
ability. Philosophical transactions of the Royal Society of London, Series A: Mathematical
and Physical Sciences, 236, 333–380.
Norris, J.M. (2015). Statistical significance testing in second language research: Basic problems
and suggestions for reform. Language Learning, 65, 97–126.
Norris, J.M., Plonsky, L., Ross, S.J., & Schoonen, R. (2015). Guidelines for reporting quantitative
methods and results in primary research. Language Learning, 65, 470–476.
Oxford, R.L. (1990). Language learning strategies: What every teacher should know. New York:
Newbury.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting prac-
tices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodological
synthesis and call for reform. The Modern Language Journal, 98, 450–470.
Plonsky, L. (Ed.) (2015). Advancing quantitative methods in second language research. New
York: Routledge.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes: The
case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Oswald, F.L. (2014). How big Is ‘big’? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Radwan, A.A. (2011). Effects of L2 proficiency and gender on choice of language learning strate-
gies by university students majoring in English. The Asian EFL Journal, 13, 115–163.
Rayson, P., Berridge, D., & Francis, B. (2004). Extending the Cochran rule for the comparison
of word frequencies between corpora. Unpublished paper presented at the 7th International
Conference on Statistical analysis of textual data (JADT 2004), Louvain-la-Neuve, Belgium.
Rothstein, H.R., Sutton, A.J., & Borenstein, M. (Eds.). (2005). Publication bias in meta-analysis:
Prevention, assessment and adjustments. Chichester: Wiley.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha.
Psychometrika, 74, 107–120.
Stevens, J. (2009). Applied multivariate statistics for the social sciences. 5th edition. New York:
Routledge.
Tabachnick, B.G., & Fidell, L.S. (2013). Using multivariate statistics. 6th edition. Boston, MA:
Pearson.
Tseng, W.-T., Dörnyei, Z., & Schmitt, N. (2006). A new approach to assessing strategic learning:
The case of self-regulation in vocabulary Acquisition. Applied Linguistics, 27, 78–102.
Al-Hoorie and Vitta 743
Vitta, J.P., & Al-Hoorie, A.H. (2017). Scopus- and SSCI-indexed L2 journals: A list for the Asia
TEFL community. The Journal of Asia TEFL, 14, 784-792.
Weiner, G. (2001). The academic journal: Has it a future? Education Policy Analysis Archives, 9.
Available online at http://doi.org/10.14507/epaa.v9n9.2001 (assessed March 2018).
Author biographies
Ali H. Al-Hoorie is assistant professor at the English Language Institute, Jubail Industrial College,
Saudi Arabia. He completed his PhD degree at the University of Nottingham, UK, under the super-
vision of Professors Zoltán Dörnyei and Norbert Schmitt. He also holds an MA in Social Science
Data Analysis from Essex University, UK. His research interests include motivation theory,
research methodology, and complexity.
Joseph P. Vitta is active in TESOL/Applied Linguistics research with interests and publications in
lexis, curriculum design, research methods, and computer-assisted language learning. As an ELT
professional, he has over 12 years’ experience as both a program manager and language teacher.
Appendix 1
Keywords
CALL, computer assisted language learning, EAP, English for academic purposes, EFL, English
as a foreign language, ELL, English language learner, ELT, English language teaching, ESP,
English for specific purposes, FLA, Foreign language acquisition, foreign language, language
acquisition, language assessment, language testing, language classroom, language curriculum, lan-
guage education, language educator, language learning, language learner, language learners, lan-
guage proficiency, language teaching, language teacher, language teachers, second language,
SLA, second language acquisition, TEFL, teaching English as a foreign language, TESL, teaching
English as a second language, TESOL, teaching English to speakers of other languages, teaching
English.
Appendix 2
Journals
1. Applied Linguistics
2. Asian EFL Journal
3. Asian ESP Journal
4. CALL-EJ
5. Computer Assisted Language Learning
6. Electronic Journal of Foreign Language Teaching
7. ELT Journal
8. English for Specific Purposes
9. Foreign Language Annals
10. Indonesian Journal of AL
11. Innovation in Language Learning and Teaching
12. International Review of Applied Linguistics in Language Teaching
13. Iranian Journal of Language Teaching Research
744 Language Teaching Research 23(6)
14. Journal of Asia TEFL
15. Journal of English for Academic Purposes
16. Journal of Second Language Writing
17. Language Assessment Quarterly
18. Language Learning
19. Language Learning & Technology
20. Language Learning Journal
21. Language Teaching Research
22. Language Testing
23. Modern Language Journal
24. ReCALL
25. Second Language Research
26. Studies in Second Language Acquisition
27. System
28. Teaching English with Technology
29. TESOL Quarterly
30. JALT CALL Journal
... The call for L2 researchers to improve the demonstrated rigor in their work dates back several decades (Gass et al., 2021). An entire strand of L2 research has been established with the aim of understanding quantitative research quality in the field while providing recommendations for improvement (Plonsky, 2013(Plonsky, , 2014Al-Hoorie & Vitta, 2019). The call for reform is broad, with research synthesis and methodological commentary work focusing on various issues such as the handling of outliers (Nicklin & Plonsky, 2020), the ethical management of data (Isbell et al., 2022), and the need to promote research with populations outside of the WEIRD (White, Educated, Industrialized, Rich, and Democratic) circle (Andringa & Godfroid, 2020). ...
Article
Full-text available
In this discussion paper, I examine three exceptional articles presented at the 2023 Symposium of the Japan Association for Language Teaching’s Vocabulary Special Interest Group. These papers are analyzed from the standpoint of providing positive examples for future vocabulary researchers to follow in order to promote rigor in their work. I also discuss how these reports demonstrate a clear self-awareness regarding the conclusions their inquiries can and cannot support. Taken together, these three reports are invaluable for promoting better vocabulary research.
... Second, there is still an absence of a systematic review of methodologies commonly employed in empirical research on MTFI. It is necessary to focus on this aspect because there have been calls for more rigorous methodologies in L2 empirical studies (seeAl-Hoorie & Vitta, 2019;Zhang & Plonsky, 2020). Details about these methodological features could help researchers understand how studies on MTFI have been conducted, based on which suggestions for future research can be made. ...
Article
The potential of model texts as a feedback instrument (MTFI) in second language (L2) writing has been explored for about two decades and continues to receive increasing interest from L2 scholars. However, to date, there is still an absence of a comprehensive review of studies in this particular area. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) procedure, this study aims to fill this gap by systematically reviewing 25 empirical studies on MTFI dating up to 1 April 2023, specifically focusing on three main aspects: study contexts, methodological characteristics, and primary areas of focus and key findings. The results revealed that existing MTFI studies (1) largely targeted low-intermediate English as a foreign language (EFL) learners at primary schools, (2) primarily adopted a quasi-experimental design involving a three-stage narrative writing task in a classroom-based setting, (3) predominantly focused on writing as a process, and (4) consistently reported MTFI’s potential for promoting L2 writing gains, especially in terms of lexical aspects. These results not only further our understanding about the effect of MTFI on different dimensions of task performance and L2 learning (i.e. L2 writing), but also provide some pedagogical implications for practices. Suggestions for addressing methodological issues in future studies are provided to advance this research domain.
... Mistaken decisions in the data analytic stage of research can lead to failures to provide quantitatively sound outcomes. This is an issue related to limited understanding and improper application of statistical methods, which seems common across foreign/ second language acquisition research fields (Al-Hoorie & Vitta, 2018;Gonulal, 2020;Lindstromberg, 2016a). Unintentional misuse of methods, underpowered studies, and a lack of robustness of statistical tests are ongoing problems here (Collett, 2021;. ...
Article
In research of such psychological phenomena as mindsets it has been argued that researchers are prone to various methodological and theoretical missteps that can weaken the validity and reliability of research conclusions. This paper serves as a brief overview of some of the issues that have been facing researchers in psychology, and which are seen to be contributing to a theory crisis in psychology. Problems concerning defining and measuring psychological phenomena, the validity of operationalized constructs, and the implications of methodological missteps on research outcomes are discussed. These are all issues arising from an orthodoxy in research which adheres to questionable empiricist methodologies grounded in logical positivist conceptions of the research process. As it is likely the same state of affairs will arise when researching these phenomena in relation to language learning, the aim is to explicate some suggestions into how methodologically-sound research may be conducted and proceed.
... The results from Table 4 point to their consistency with practical reality and operation research goals. For example, in recent years, there has been evidently increased interest in using reinforcement learning for optimization of real-time job scheduling tasks [76][77][78][79][80][81][82][83][84][85][86]. This fact can be correlated with the continuing trend of mass customization in the production of consumer goods [87,88]. ...
... The results from Table 4 point to their consistency with practical reality and operation research goals. For example, in recent years, there has been evidently increased interest in using reinforcement learning for optimization of real-time job scheduling tasks [76][77][78][79][80][81][82][83][84][85][86]. This fact can be correlated with the continuing trend of mass customization in the production of consumer goods [87,88]. ...
Article
Full-text available
In this study, a systematic review on production scheduling based on reinforcement learning (RL) techniques using especially bibliometric analysis has been carried out. The aim of this work is, among other things, to point out the growing interest in this domain and to outline the influence of RL as a type of machine learning on production scheduling. To achieve this, the paper explores production scheduling using RL by investigating the descriptive metadata of pertinent publications contained in Scopus, ScienceDirect, and Google Scholar databases. The study focuses on a wide spectrum of publications spanning the years between 1996 and 2024. The findings of this study can serve as new insights for future research endeavors in the realm of production scheduling using RL techniques.
... One of the primary factors behind this level of journal selectivity is the neverending pursuit of higher impact factors. Briefly, an impact factor is a numeric value summarizing the number of citations (recent) papers of a journal have received (see Al-Hoorie & Vitta, 2019). Variation among impact factors is typically related to how each is calculated, such as whether self-citations are excluded and whether the impact factor of the citing journal is additionally taken into account. ...
Book
Full-text available
This volume, featuring 14 chapters from some of the most forward-thinking scholars applied linguistics, seeks to provide and equip readers with an in-depth and field-specific understanding of OS principles and practices. As evident in the table of contents, the chapters cover a range of topics related to OS. Some are largely conceptual, seeking to foster an understanding of the rationales for OS as well as the open science ethic; others are much more practical, offering hands-on guidance for OS practices (e.g., preregistration, data sharing) whether at the individual researcher, journal, or programmatic level.
... Lindstromberg (2016) [8] notes that only 22% of 90 articles on language teaching published between 1997 and 2015 reported the verification of assumptions. A more recent review by Al-Hoorie and Vitta (2018) [9] scrutinized 150 articles published in 30 applied linguistics journals from 2016 onwards, demonstrating that 74.7% validated these assumptions. Recent findings by Hu and Plonsky (2021) [10] indicate that only 17% of 107 articles on language learning and L2 published between 2012 and 2017 checked and reported all assumptions for a given statistical method, and only 24% checked and reported at least one assumption. ...
Article
Full-text available
Data analysis can be accurate and reliable only if the underlying assumptions of the used statistical method are validated. Any violations of these assumptions can change the outcomes and conclusions of the analysis. In this study, we developed Smart Data Analysis V2 (SDA-V2), an interactive and user-friendly web application, to assist users with limited statistical knowledge in data analysis, and it can be freely accessed at https://jularatchumnaul.shinyapps.io/SDA-V2/. SDA-V2 automatically explores and visualizes data, examines the underlying assumptions associated with the parametric test, and selects an appropriate statistical method for the given data. Furthermore, SDA-V2 can assess the quality of research instruments and determine the minimum sample size required for a meaningful study. However, while SDA-V2 is a valuable tool for simplifying statistical analysis, it does not replace the need for a fundamental understanding of statistical principles. Researchers are encouraged to combine their expertise with the software’s capabilities to achieve the most accurate and credible results.
... According to Hiver et al. (2024), there has been a trend for systematic reviews in L2 research to restrict study selection to Social Science Citation Index (SSCI) journals as they typically contain high-quality studies that undergo a rigorous peer-review process and provide sufficient details about research contexts and instruments (Al-Hoorie & Vitta, 2019;Zhang, 2020). Following other systematic reviews (Albert & Csizér, 2022;Hiver et al., 2024;Ren, et al., 2022;Zhang, 2020), the current synthesis is limited to articles in SSCI-indexed journals. ...
Article
Full-text available
The current systematic review aims to understand the role of individual differences (IDs) in second language (L2) pragmatics and offer recommendations for further research in this domain. To achieve this, 86 empirical studies in L2 pragmatics published in SSCI (Social Science Citation Index) journals from 2000 to 2023 were systematically selected and coded for IDs, pragmatic features and modalities investigated, methodological orientation, and research findings. The significance of this review paper is the inclusion of both qualitative and quantitative findings and the identification of research trends in the last two decades. This inclusive approach allowed us to reveal the extent to which ID factors were studied within the classic research paradigm (i.e., the influence of IDs in isolation) and the socio‐dynamic research paradigm (i.e., the combined influence of IDs and context over time). Findings revealed that the IDs identified (L2 proficiency, identity, intercultural competence, language mindsets, willingness to communicate, motivation, and cognitive factors) exerted considerable influence on L2 pragmatic competence. Accounting for research gaps, trends in methodology, and findings in the area, we provide specific recommendations for future research to bring the study of L2 pragmatics in line with the ‘affective turn’ in second language acquisition.
Article
Full-text available
The growth mindset is essential for learning because students encounter numerous obstacles during the pandemic. However, only a few studies have examined research trends in growth mindset pedagogy. In filling this gap, this study aims to conduct a bibliometric study using the growth mindset pedagogy during the pandemic. Data from 81 Scopus-sourced scholarly articles on growth mindset in education from 2020–2022 was used. VOSviewer qualitatively analyzed the data. Results showed that the research cluster’s growth mindset is diverse and adaptive to learning dynamics, notably during the COVID-19 pandemic. Several research clusters examined instructors, students, and parents’ roles in learning. Another cluster focuses on academic achievement and teacher efforts to improve student performance. The research addresses student issues like depression, academic fatigue, and suicide ideation. Research on a growth mindset in education has grown quantitatively since 2021. Clusters 1, 2, and 4, 5 have the most research on active learning design to improve student academic performance and accomplishment through teachers and other supporting elements. Even though academic fatigue can lead to depression and suicidal ideation, research on academic achievement-related student issues is scarce. In conclusion, a growth mindset can significantly enhance academic performance and solve learning problems.
Article
Full-text available
In many Asia TEFL contexts, emphasis in academic promotions and tenure is placed on journals listed in Scopus and the Social Sciences Citation Index (SSCI). However, these indexing services do not offer a subcategory specific to second/foreign language (L2) research and practice, apart from rather generic categories such as linguistics and education. To address this gap, this brief report attempts to construct a comprehensive list of L2 Scopus- and SSCI-indexed journals. In this article, we present this list and the two-stage process we followed to obtain it.
Article
Full-text available
This study investigates the use of language learning strategies by 128 students majoring in English at Sultan Qaboos University (SQU) in Oman. Using Oxford's (1990) Strategy Inventory for Language Learners (SILL), the study seeks to extend our current knowledge by examining the relationship between the use of language learning strategies (LLS) and gender and English proficiency, measured using a three-way criteria: students' grade point average (GPA) in English courses, study duration in the English Department, and students' perceived self-rating. It is as well a response to a call by Oxford to examine the relationship between LLSs and various factors in a variety of settings and cultural backgrounds (see Oxford, 1993). Results of a one-way analysis of variance (ANOVA) showed that the students used metacognitive strategies significantly more than any other category of strategies, with memory strategies ranking last on students' preference scale. Contrary to the findings of a number of studies (see e.g., Hong-Nam & Leavell, 2006), male students used more social strategies than female students, thus creating the only difference between the two groups in terms of their strategic preferences. Moreover, ANOVA results revealed that more proficient students used more cognitive, metacognitive and affective strategies than less proficient students. As for study duration, the results showed a curvilinear relationship between strategy use and study duration, where freshmen used more strategies followed by juniors, then seniors and sophomores, respectively. Analysis of the relationship between strategy use and self-rating revealed a sharp contrast between learners who are selfefficacious and those who are not, favoring the first group in basically every strategy category. To find out which type of strategy predicted learners' L2 proficiency, a backward stepwise logistic regression analysis was performed on students' data, revealing that use of cognitive strategies was the only predictor that distinguished between students with high GPAs and those with low GPAs. The present study suggests that the EFL cultural setting may be a factor that determines the type of strategies preferred by learners. This might be specifically true since some of the results obtained in this study vary from results of studies conducted in other cultural contexts. Results of this study may be used to inform pedagogical choices at university and even pre-university levels.
Article
Full-text available
This paper presents a set of guidelines for reporting on five types of quantitative data issues: (1) Descriptive statistics, (2) Effect sizes and confidence intervals, (3) Instrument reliability, (4) Visual displays of data, and (5) Raw data. Our recommendations are derived mainly from various professional sources related to L2 research but motivated by results from investigations into how well the field as a whole is following these guidelines for best methodological practices, and illustrated by L2 examples. Although recent surveys of L2 reporting practices have found that more researchers are including important data such as effect sizes, confidence intervals, reliability coefficients, research questions, a priori alpha levels, graphics, and so forth in their research reports, we call for further improvement so that research findings may build upon each other and lend themselves to meta‐analyses and a mindset that sees each research project in the context of a coherent whole.
Article
To date, the journal impact factor (JIF), owned by Thomson Reuters (now Clarivate Analytics), has been the dominant metric in scholarly publishing. Hated or loved, the JIF dominated academic publishing for the better part of six decades. However, a rise in the ranks of unscholarly journals, academic corruption and fraud has also seen the accompaniment of a parallel universe of competing metrics, some of which might also be predatory, misleading, or fraudulent, while others yet may in fact be valid. On December 8, 2016, Elsevier B.V. launched a direct competing metric to the JIF, CiteScore (CS). This short communication explores the similarities and differences between JIF and CS. It also explores what this seismic shift in metrics culture might imply for journal readership and authorship.
Article
Graphics are often mistaken for a mere frill in the methodological arsenal of data analysis when in fact they can be one of the simplest and at the same time most powerful methods of communicating statistical information (Tufte, 2001). The first section of the article argues for the statistical necessity of graphs, echoing and amplifying similar calls from Hudson (2015) and Larson–Hall & Plonsky (2015). The second section presents a historical survey of graphical use over the entire history of language acquisition journals, spanning a total of 192 years. This shows that a consensus for using certain types of graphics, which lack data credibility, has developed in the applied linguistics field, namely the bar plot and the line graph. The final section of the article is devoted to presenting various types of graphic alternatives to these two consensus graphics. Suggested graphics are data accountable and present all of the data, as well as a summary structure; such graphics include the scatterplot, beeswarm or pirate plot. It is argued that the use of such graphics attracts readers, helps researchers improve the way they understand and analyze their data, and builds credibility in numerical statistical analyses and conclusions that are drawn.
Article
This article reviews all (quasi)experimental studies appearing in the first 19 volumes (1997–2015) of Language Teaching Research (LTR). Specifically, it provides an overview of how statistical analyses were conducted in these studies and of how the analyses were reported. The overall conclusion is that there has been a tight adherence to traditional methods and practices, some of which are suboptimal. Accordingly, a number of improvements are recommended. Topics covered include the implications of small average sample sizes, the unsuitability of p values as indicators of replicability, statistical power and implications of low power, the non-robustness of the most commonly used significance tests, the benefits of reporting standardized effect sizes such as Cohen’s d, options regarding control of the familywise Type I error rate, analytic options in pretest–posttest designs, ‘meta-analytic thinking’ and its benefits, and the mistaken use of a significance test to show that treatment groups are equivalent at pretest. An online companion article elaborates on some of these topics plus a few additional ones and offers guidelines, recommendations, and additional background discussion for researchers intending to submit to LTR an article reporting a (quasi)experimental study.
Article
It is common to summarize statistical comparisons by declarations of statistical significance or nonsignificance. Here we discuss one problem with such declarations, namely that changes in statistical significance are often not themselves statistically significant. By this, we are not merely making the commonplace observation that any particular threshold is arbitrary—for example, only a small change is required to move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities. The error we describe is conceptually different from other oft-cited problems—that statistical significance is not the same as practical importance, that dichotomization into significant and nonsignificant results encourages the dismissal of observed differences in favor of the usually less interesting null hypothesis of no difference, and that any particular threshold for declaring significance is arbitrary. We are troubled by all of these concerns and do not intend to minimize their importance. Rather, our goal is to bring attention to this additional error of interpretation. We illustrate with a theoretical example and two applied examples. The ubiquity of this statistical error leads us to suggest that students and practitioners be made more aware that the difference between “significant” and “not significant” is not itself statistically significant.
Article
Traditions of statistical significance testing in second language (L2) quantitative research are strongly entrenched in how researchers design studies, select analyses, and interpret results. However, statistical significance tests using p values are commonly misinterpreted by researchers, reviewers, readers, and others, leading to confusion regarding the actual findings of primary studies and critical challenges for the accumulation of meaningful knowledge about language learning research. This paper outlines the basic challenges of accurately calculating and interpreting statistical significance tests, explores common examples of incorrect interpretations in L2 research, and proposes strategies for resolving these problems.
Article
Adequate reporting of quantitative research about language learning involves careful consideration of the logic, rationale, and actions underlying both study designs and the ways in which data are analyzed. These guidelines, commissioned and vetted by the board of directors of Language Learning, outline the basic expectations for reporting of quantitative primary research with a specific focus on Method and Results sections. The guidelines are based on issues raised in: Norris, J. M., Ross, S., & Schoonen, R. (Eds.). (2015). Improving and extending quantitative reasoning in second language research. Currents in Language Learning, volume 2. Oxford, UK.