Language Teaching Research
2019, Vol. 23(6) 727 –744
© The Author(s) 2018
Article reuse guidelines:
The seven sins of L2 research:
A review of 30 journals’
statistical quality and their
CiteScore, SJR, SNIP, JCR
Ali H. Al-Hoorie
Jubail Industrial College, Saudi Arabia
Joseph P. Vitta
Queen’s University Belfast, UK; Rikkyo University – College of Intercultural Communication, Japan
This report presents a review of the statistical practices of 30 journals representative of the
second language field. A review of 150 articles showed a number of prevalent statistical violations
including incomplete reporting of reliability, validity, non-significant results, effect sizes, and
assumption checks as well as making inferences from descriptive statistics and failing to correct
for multiple comparisons. Scopus citation analysis metrics and whether a journal is SSCI-indexed
were predictors of journal statistical quality. No clear evidence was obtained to favor the newly
introduced CiteScore over SNIP or SJR. Implications of the results are discussed.
citation analysis metrics, CiteScore, JCR Impact Factor, journal quality, quantitative research,
second language, SJR, SNIP
Second language (L2) researchers have long been interested in improving the quantita-
tive rigor in the field. In an early study, Brown (1990) pointed out the need to improve
quantitative quality in the field, singling out the importance of using ANOVA as opposed
Joseph P. Vitta, Queen’s University Belfast, University Road, Belfast BT7 1NN, UK.
767191LTR0010.1177/1362168818767191Language Teaching ResearchAl-Hoorie and Vitta
728 Language Teaching Research 23(6)
to multiple t-tests. This recommendation may now be common knowledge to many
researchers, pointing to the fact that our field has made substantial progress in statistical
practices over the decades (see Plonsky, 2014). The field has now moved to relatively
more advanced topics, including the need for a priori power calculation and assumption
checking (e.g. Larson-Hall, 2016; Norris et al., 2015; Plonsky, 2015). The analysis of
quantitative practices in the field is currently an active area of research, as ‘there is no
controversy over the necessity of rigorous quantitative methods to advance the field of
SLA’ (Plonsky, 2013, p. 656).
In addition to the goal of improving quantitative quality, there has also been an inter-
est in overall journal quality, both within the L2 field and in academia in general (see
Egbert, 2007; Garfield, 2006; Vitta & Al-Hoorie, 2017). Indexing has often been seen as
a key factor in journal evaluation, with Scopus and the Web of Science being the two
indexes of prestige (Chadegani et al., 2013; Guz & Rushchitsky, 2009). Within these two
catalogues, citation analysis metrics have been employed as efficient measurements of
overall journal quality for some time. Garfield (2006), for example, noted that the Web
of Science’s Impact Factor has been in use since the 1950s as the index’s citation analysis
metric. At the same time, indexing and citation analysis have not universally been
accepted as a definitive way to assess journal quality within a field (e.g. Brumback,
2009; Egbert, 2007). This is primarily due to the notion that citation quantifies only one
aspect in the overall evaluation of a journal and might miss other, perhaps equally impor-
tant, aspects of journal quality.
Based on similar considerations, Plonsky (2013) has suggested the need for investi-
gating the relationship between statistical practices and journal quality in the L2 field. To
date, such an investigation does not seem to have been performed yet. The current study
therefore aimed to address this gap. A total of 150 quantitative articles from 30 L2 jour-
nals were assessed for quantitative rigor. The relationship between each journal’s quan-
titative quality and a number of popular journal quality measurements relating to citation
analysis and indexing were then examined.
1 Quantitative rigor
In recognition of the importance of quantitative knowledge in the second language (L2)
field, a number of researchers have recently investigated various aspects related to meth-
odological and statistical competence of L2 researchers. Loewen and collegues (2014),
for instance, found that only 30% of professors in the field report satisfaction with their
level of statistical training, while only 14% doctoral students do so. Loewen and col-
leagues (2017) subsequently extended this line of research by using independent meas-
ures of quantitative competence, rather than simply relying on self-reported knowledge,
and also found significant gaps in the statistical literacy of L2 researchers. Furthermore,
quality of research design does not seem a priority for some scholars in the field when
evaluating the status of different journals (Egbert, 2007).
Inadequate quantitative knowledge is likely going to be evident in the field’s publica-
tions. In one of the first empirical assessments of the methodological quality in published
Al-Hoorie and Vitta 729
L2 research, Plonsky and Gass (2011) investigated quantitative reports in the L2 interac-
tion tradition spanning a period of around 30 years. They observed that ‘weaknesses in
the aggregate findings appear to outnumber the strengths’ (p. 349) and speculated
whether this is in part due to inadequate training of researchers. Subsequently, Plonsky
(2013, 2014) investigated articles published in two major journals in the field (Language
Learning and Studies in Second Language Acquisition). In line with previous findings,
these studies identified a number of limitations prevalent in the field, such as lack of
control in experimental designs and incomplete reporting of results. In a more recent
study, Lindstromberg (2016) examined articles published in one journal (Language
Teaching Research) over a period of about 20 years. This study also found issues similar
to those obtained in previous studies, such as incomplete reporting and overemphasizing
significance testing over effect sizes.
2 Unpacking the ‘sins’ of quantitative research
A large number of topics fall under quantitative research, making a parsimonious clas-
sification of these topics no easy task (see Stevens, 2009; Tabachnick & Fidell, 2013). In
this article, we adopt an approach similar to that used by Brown (2004), where statistical
topics are classified into three broad areas: psychometrics, inferential testing, and
Psychometrics – subsuming reliability and validity – has been of major concern to
past research into the quantitative rigor in the field. Larson-Hall and Plonsky (2015)
pointed out that the most commonly used reliability coefficients in the field are
Cronbach’s α (for internal consistency) and Cohen’s κ (interrater reliability). Norris
et al. (2015) echoed the call for researchers to address reliability while also arguing
that researchers provide ‘validity evidence’ (p. 472). Validity evidence is of equal
importance to reliability, though validity is trickier since it lacks a straightforward
numeric value that researchers can report. For this reason, Norris et al. (2015) sug-
gested the use of conceptual validity evidence, such as pilot studies and linking instru-
ments to past research. Empirical validity evidence can also be utilized, such as factor
analysis for construct validity (Tabachnick & Fidell, 2013). In L2 research, reliability
has usually been emphasized, while validity considerations have been somewhat over-
looked. For example, studies using the Strategy Inventory for Language Learning
(Oxford, 1990) often reported internal reliability (see Hong-Nam & Leavell, 2006;
Radwan, 2011), despite calls for the need to also consider validity evidence for this
instrument (Tseng, Dörnyei, & Schmitt, 2006).
When it comes to inferential testing, a number of issues have been pointed out in
previous methodological research. One of these issues is, obviously, the need to use
inferential statistics. Descriptive statistics alone can sometimes be informative (Larson-
Hall, 2017), but for most purposes it is not clear whether an observed trend is merely a
natural fluctuation that should not be overinterpreted. Inferential statistics, by definition,
permits the researcher to generalize the observed trend from the sample at hand to the
population. In their investigation of interaction research, Plonsky and Gass (2011) report
that around 6% did not perform any inferential tests. Similarly, Plonsky (2013) reports in
a subsequent study that 12% of the studies in the sample did not use inferential testing.
730 Language Teaching Research 23(6)
A second issue concerning inferential testing has to do with complete reporting of the
results (Larson-Hall & Plonsky, 2015). A number of methodologists (e.g. Brown, 2004;
Nassaji, 2012; Norris et al., 2015; Plonsky, 2013; 2014, among others) have emphasized
that inferential tests must have all their relevant information presented for transparency
and replicability. In the case of t-tests, for example, readers would need information
regarding means, standard deviations, degrees of freedom, t-value, and p-value.
Effect sizes, which describe the magnitude of the relationship among variables, is a
further required aspect of quantitative rigor. In arguing for the centrality of effect sizes,
Norris (2015) asserted that L2 researchers have a tendency to conflate statistical signifi-
cance with practical significance. In the same vein, Nassaji (2012) posited that our field
has yet to firmly grasp that p-values only speak to Type I error probability and not the
strength of association between dependent and independent variables (see Cohen, Cohen,
West, & Aiken, 2003). Effect size reporting has therefore been stressed by L2 method-
ologists in recent years (e.g. Larson-Hall & Plonsky, 2015; Plonsky, 2013, 2014; Plonsky
& Gass, 2011; Plonsky & Oswald, 2014).
In addition to the above, a common situation in inferential testing is when researchers
perform several tests. Brown (1990) suggests that researchers should employ ANOVA as
an alternative to multiple t-tests in order to control Type I error rate. Norris and colleagues
(Norris, 2015; Norris et al., 2015) also highlight the need to perform a correction for alpha
level in multiple companions. A common procedure is the Bonferroni correction, where the
alpha level is divided by the number of tests performed. As an illustration, if 10 t-tests are
performed simultaneously in one study, the alpha level becomes .05/10 = .005. In this
example, a result is considered significant only if the p-value is less than .005. Procedures
that are less conservative than the Bonferroni correlation has also been proposed (e.g.
Holm–Bonferroni and Benjamini–Hochberg; see Larson-Hall, 2016; Ludbrook, 1998).
Finally, an essential consideration in quantitative research is checking that the necessary
assumptions are satisfied. Although it is typically classified under inferential statistics (i.e.
to determine whether parametric tests are appropriate), this point is placed in a separate
category in the present article for two reasons. First, checking assumptions is not limited to
inferential testing but also applies to descriptive statistics. For example, reporting the mean
and standard deviation assumes that the data are normally distributed. Otherwise, the mean
and standard deviation would not be representative of the central tendency and dispersion
of the data, respectively. Psychometrics also have assumptions that need to be satisfied. For
example, Cronbach’s alpha assumes that the construct is unidimensional (Green, Lissitz, &
Mulaik, 1977), or else its value could be inflated. Second, assumption checking seems
consistently overlooked in L2 research, despite repeated calls emphasizing its importance
(e.g. Lindstromberg, 2016; Loewen & Gass, 2009; Nassaji, 2012; Norris, 2015; Norris
et al., 2015; Plonsky, 2014; Plonsky & Gass, 2011). In the present study, the violations
reviewed above are called the seven ‘sins’ of quantitative research (see Table 1).
III Journal quality
Discussion of the methodological rigor of research articles ultimately speaks to the qual-
ity of the field’s journals. Assessment of journal quality is of particular importance
because of the value different stakeholders in both academic and professional arenas
Al-Hoorie and Vitta 731
place on research found in them (Weiner, 2001). Egbert (2007) surveyed multiple ways
to gauge journal quality in the L2 field, such as citation analysis, rejection rate, time to
publication, and expert opinion.
In reality, however, citation analysis has been one of the most common means of
evaluating different journals, probably because it offers a simple numeric impact value
to rank each journal (see Brumback, 2009; Colledge et al., 2010; Leydesdorff & Opthof,
2010). The history of citation analysis metrics dates back to the 1950s (see Garfield,
2006). Academia, in general, has been eager to embrace a means to gauge journal quality
empirically (Weiner, 2001). Currently, the most commonly used citation analysis metrics
are Source Normalized Impact per Paper (SNIP), SCImago Journal Rank (SJR), and
Journal Citation Reports (JCR) Impact Factor. The former two are maintained by
Elsevier’s Scopus, while JCR is maintained by Clarivate’s Web of Science (WOS; for-
merly Thompson-Reuters). WOS also maintains the Social Sciences Citation Index
(SSCI), which is the most relevant to our field.
Intense competition exists between these two indexing services, resulting in continuous
improvement of their metrics (see Chadegani et al., 2013; Guz & Rushchitsky, 2009). As
part of this development, Scopus has recently unveiled a new metric called CiteScore (see
da Silva & Memon, 2017), which is calculated in a similar way to JCR except that the look-
back period is three years rather than two. Table 2 presents an overview of these metrics.
Metrics judging journal quality through citation analysis have come under heavy criti-
cism (Brumback, 2009). Some have expressed doubt about the viability of using one
metric to assess the various dimensions contributing to a journal’s quality (Egbert, 2007).
Some of these metrics are proprietary and their actual calculations are not made public,
making them unverifiable. In fact, there have been reports of editors and publishers
negotiating and manipulating these metrics to improve journal rankings (see Brumback,
2009). Nevertheless, citation analysis metrics remain the primary means of assessing
journal quality, thus governing employment, tenure, funding, and livelihood for many
researchers around the world.
In the L2 field, there have been attempts to evaluate the different journals available.
In one study, Jung (2004) aimed to rank the prestige of a number of journals. Jung’s
primary criteria were being teacher-oriented and being indexed by the prestigious Centre
for Information on Language Teaching and Research, then hosted in Language Teaching
Table 1. The seven sins of quantitative research.
Psychometrics 1. Not reporting reliability
2. Not discussing validity
Inferential statistics 3. Making inferences from descriptive statistics
4. Incomplete reporting, including non-significant results
5. Not reporting effect sizes
6. Not adjusting for multiple comparisons
Other 7. Not reporting assumption checks
732 Language Teaching Research 23(6)
journal (n = 12). In another study, Egbert (2007) created a list of field-specific journals
primarily based on expert opinion (n = 35). Benson, Chik, Gao, Huang, and Wang (2009)
also employed a list of journals (n = 10) for the purpose of evaluating these journals in
relation to qualitative rigor. In all of these studies, journals’ citation analysis values have
not been systematically taken into consideration in evaluating these journals.
IV The present study
As reviewed above, previous research on quantitative quality in the L2 field has tended
to focus on specific publication venues (e.g. one or two particular journals) or specific
research traditions (e.g. L2 interaction). Studies with a wider scope, on the other hand,
have not focused on either quantitative quality or its relation to popular journal citation
analysis. In this study, we aimed to conduct a broader review covering a representative
sample of L2 journals. This would allow us to obtain a more general overview of study
quality in the L2 field, as well as to empirically assess the utility of the four citation
analysis metrics as a representation of quality of different journals.
Because of this wide coverage, we also narrowed our scope to focus specifically on
statistical issues rather than more general methodological issues. For example, prior
research has repeatedly shown that many L2 researchers usually overlook important con-
siderations such as power analysis and using a control group (e.g. Lindstromberg, 2016;
Plonsky, 2013, 2014). However, it must be noted that such considerations are design
issues that need to be addressed before conducting the study. Thus, having an adequate
sample size or a control group can sometimes be governed by practical and logistical
considerations that are outside the researcher’s control. Because of the wide coverage of
our sample of journals, involving various L2 sub-disciplines where practical limitations
Table 2. Four common journal citation analysis metrics and their characteristics.
Indexing service Description
Number of citations to a journal’s articles in the past three
years divided by the total number of its articles in the
past three years. Normalized to facilitate cross-discipline
comparisons (Colledge et al., 2010).
SJR Essentially SNIP calculations that were additionally weighted,
depending on the rank of the citing journal, while excluding
self-citations. The weighting uses the PageRank algorithm
(Guerrero-Bote & Moya-Anegón, 2012).
CiteScore Total number of a journal’s citations in a given year divided by
the journal’s total number of citable publications during the
past three years (da Silva & Memon, 2017).
JCR WOS Total number of a journal’s citations in a given year divided by
the journal’s total number of citable publications during the
past two years (Garfield, 2006).
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; WOS = Web of Science.
Al-Hoorie and Vitta 733
may be an essential part of everyday research, we limited our review to data analysis
issues specifically. We aimed to answer three research questions:
1. What are the most common statistical violations found in L2 journals?
2. What is the relationship between the journal’s statistical quality and its citation
analysis scores (SNIP, SJR, CiteScore, and JCR)?
3. What is the relationship between the journal’s statistical quality and its indexing
(SSCI vs. Scopus)?
1 Inclusion criteria
In order to be included in this study, the journal had to satisfy the following criteria:
1. The journal is indexed by Scopus or SSCI.
2. The journal is within the second/foreign language learning and teaching area.
3. The journal presents original, quantitative research.
4. The journal uses English as its primary medium.
2 Journal search
The titles of journals indexed by Scopus and SSCI were examined against a list of
keywords representing various interests in the L2 field. Three experts were consulted
over two iterations to develop and validate the list of keywords (for the complete list,
see Appendix 1). A few well-known journals were not captured by our keywords (e.g.
System) and these were subsequently added. The final list of journals satisfying all
inclusion criteria included 30 journals (for the complete list, see Appendix 2). All jour-
nals were indexed in Scopus but only 19 were additionally indexed in SSCI. Our sam-
ple of 30 journals was larger than previous samples by Jung (2004, n = 12) and by
Benson et al. (2009, n = 10). It was slightly smaller than that by Egbert (2007, n = 35),
but this was primarily because our sample was limited to journals indexed in either
Scopus or SSCI. Therefore, it seems reasonable to argue that our sample is representa-
tive of journals in the L2 field.
Table 3 presents an overview of the journals in our sample and their citation analysis
scores. Both the Kolmogorov–Smirnov and Shapiro–Wilk tests for normality showed
that the four impact factors were normally distributed, ps > .05.
3 Data analysis
Five recent quantitative articles were randomly extracted from each of the 30 journals,
resulting in a total of 150 articles. All articles were published in 2016 or later, thus mak-
ing them representative of the latest quantitative trends in these journals. While a five-
article sample may not be fully representative of trends in a journal over time, our aim
was to investigate quantitative practices found in the most recent publications in the L2
734 Language Teaching Research 23(6)
field. This would allow us to find out the most common statistical violations in recent
literature. We discuss this issue further in the limitations of this study.
Each article was subsequently reviewed by two researchers independently against the
violations of statistical conventions described in Table 1. Controversial topics were
avoided, such as the adequacy of classical test theory vs. item response theory, or explor-
atory factor analysis vs. principal components analysis. In coding violations in each arti-
cle, repeated violations of the same issue were coded as one violation only, so violations
were coded in a binary format (present vs. not present) for each category. Interrater reli-
ability was high (96.7% agreement, κ = .91), and all discrepancies were resolved through
discussion until 100% agreement was reached.
Violations were first averaged within each journal, and then the overall mean and stand-
ard deviation were computed (see Table 4). The table also reports the one-sample t-test
that examined whether the observed means were significantly different from zero. All
results were significant and lower than the Bonferroni-adjusted significance level (.05/7
= .007). The effect sizes were also generally substantial.
Since the maximum possible mean was 1.0 for each violation following our binary
coding, the means in the table can also be interpreted as a probability. As an illustration,
the probability of an article having issues with reliability is 24.7% (see Table 4). In other
words, almost one in every four articles would have a reliability issue. On the other hand,
almost every other article makes inferences from descriptive statistics (44%). Similarly,
just over one in three articles does not report effect sizes (38.7%).
Since the normality and linearity assumptions were satisfied, the correlations between
journals’ statistical quality and their citation analysis scores were examined to shed further
light on these results. There was a positive correlation between the statistical quality (vio-
lations were coded here so that a higher value indicated higher quality) of the 30 journals
and their Scopus citation analysis metrics (rSJR = .414, p = .023; rSNIP = .339, p = .067;
rCiteScore = .344, p = .062). In other words, SJR accounted for 17.1% of the variance
observed in journals’ statistical quality while SNIP and CiteScore accounted for 11.5%
and 11.2% of the observed variance, respectively. JCR impact factor was non-significant
for the 19 journals indexed by SSCI, r = –.129, p = .599, accounting for a negligible 1.7%
of the variance.
Table 3. Number of journals included and means and standard deviations of their 2015
citation analysis scores.
nImpact factor M (SD)
30 SNIP 1.24 (0.78)
SJR 0.98 (0.77)
CiteScore 1.17 (0.84)
SSCI 19 JCR 1.42 (0.60)
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; SSCI = Social Sciences Citation Index.
Al-Hoorie and Vitta 735
Although the correlation between JCR and journal statistical quality was non-signifi-
cant, an independent samples t-test showed that non-SSCI-indexed journals had signifi-
cantly more violations (M = 11.27, SD = 3.35, n = 11) than SSCI-indexed ones (M = 7.89,
SD = 3.59, n = 19), t(28) = 2.54, p = .017, d = 0.97. These results suggest that SSCI-
indexed journals demonstrate higher quantitative rigor. Table 5 lists L2 journals indexed
by both Scopus and SSCI that demonstrated the fewest violations.1
This article has presented the results of an analysis of 150 articles derived from a list of 30
journals representative of the L2 field. A number of statistical violations were observed
with varying degrees of frequency. The results also showed that Scopus citation analysis
metrics represent moderate predictors of statistical quality (accounting for around 11–
17% of the variance), with no evidence to favor the newly introduced CiteScore over
SNIP or SJR. Although these metrics account for less than 20% of the variance in the
observed quality of L2 journals, this magnitude might be considered reasonable, consider-
ing that statistical quality is only one dimension factoring in the overall quality of a jour-
nal. Other dimensions include non-statistical components of quantitative articles as well
as other types of articles such as qualitative and conceptual ones. Indeed, ‘a single method,
regardless of the number of components included, could not account for important differ-
ences among journals and in reasons for publishing in them’ (Egbert, 2007, p. 157).
The results also show that JCR was not a significant predictor of journal statistical
quality. This finding does not necessarily imply that this metric is not useful. It may mean
that L2 journals indexed by SSCI do not demonstrate sufficient variation for JCR to cap-
ture it. To become indexed by SSCI is typically a long process in which journals are
expected to demonstrate a certain level of quality. Indeed, the results above showed that
journals indexed in SSCI exhibited fewer statistical violations than non-SSCI journals.
Overall, these results suggest two ways to evaluate L2 journal quality: 1) the journal’s SJR
Table 4. Prevalence of the seven violations emerging from the analysis.
Theme M SD t p d
1. Not reporting reliability 0.247 0.23 5.81 < .0001 1.06
2. Not discussing validity 0.087 0.13 3.81 .00066 0.70
3. Making inferences from descriptive
0.440 0.29 8.20 < .0001 1.50
4. Incomplete reporting, including non-
0.253 0.23 5.92 < .0001 1.08
5. Not reporting effect sizes 0.387 0.26 8.25 < .0001 1.51
6. Not adjusting for multiple comparisons 0.140 0.20 3.87 .00056 0.71
7. Not reporting assumption checks 0.253 0.25 5.50 < .0001 1.00
736 Language Teaching Research 23(6)
value and 2) whether the journal is SSCI-indexed. The remainder of this article offers a
brief overview of the most common violations emerging from the present analysis.
It is important for researchers to report details on the reliability and validity of their
instruments. In our analysis, a number of articles did not report the reliability of their
instruments, particularly dependent variables. In situations where manual coding is
involved, it is also important to examine and report interrater reliability as well. This
is especially important when a large amount of data is coded and subjectivity may be
When multiple scales are used (e.g. as part of a questionnaire), it is also important to
examine the factorial structure of the scales, whether using a procedure from classical
test theory or item response theory. It is common for researchers to adapt existing scales
for their own purposes but without conducting a factor analytic procedure to examine
convergent and discriminant validity of these scales. In some cases, even the original
developers of these scales did not investigate these issues. Some may argue that such
scales have been established in previous research. However, it would seem arbitrary to
require reporting reliability every time a scale is administered, but assume that other
psychometric characteristics can simply be imported from prior research. Reliability also
gives a very limited insight into the psychometric properties of a scale. Green et al.
(1977) offer examples of high reliability (e.g. over .80) that is a mere artifact of a long,
multidimensional scale, while Schmitt (1996) showed that low reliability (e.g. under .50)
is not necessarily problematic (see also Sijtsma, 2009, for a more detailed critique).
A discussion of validity is equally important. In our sample, a common situation
where a discussion of validity was lacking was in the use of authentic variables, such as
school grades. When the researcher draws from such authentic variables, it is typically
Table 5. Journals demonstrating highest quality in the present sample and their 2015 citation
Journal Scopus SSCI
SNIP SJR CiteScore JCR
Computer Assisted Language Learning 1.54 1.26 1.64 1.72
English for Specific Purposes 2.73 1.66 2.11 1.14
International Review of Applied Linguistics in Language Teaching 0.97 0.91 0.95 0.80
Language Assessment Quarterly 0.63 1.07 0.93 0.98
Language Learning 2.54 2.47 2.58 1.87
Language Testing 1.36 1.44 1.50 0.91
Modern Language Journal 1.13 1.15 1.54 1.19
Studies in Second Language Acquisition 1.41 2.49 1.99 2.23
TESOL Quarterly 1.43 1.46 1.58 1.51
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; SSCI = Social Sciences Citation Index.
Al-Hoorie and Vitta 737
out of the researcher’s hand to control for their reliability and validity. In such cases,
readers would at least need a detailed description of the variable, its characteristics, and
the circumstances surrounding its measurement in order to evaluate its adequacy for the
purpose of the article, such as whether grades are a fair reflection of proficiency in the
context in question. This information may also be helpful in resolving inconsistent results
when they emerge from different studies.
When researchers develop their own instruments, extra work is required. Instrument
development should be treated as a crucial stage in a research project. Researchers devel-
oping a new instrument should perform adequate piloting to improve the psychometric
properties of the instrument before the actual study starts in order to convince the reader
of its outcomes. Poor instruments risk misleading results.
2 Inferential statistics
One of the most common issues arising in our analysis was the tendency to make infer-
ences from descriptive statistics. It is important to be aware of the distinction between
descriptive statistics and inferential statistics. Descriptive statistics refer to the character-
istics of the sample in hand. These characteristics could potentially be idiosyncratic to
this specific sample and not generalizable to the population it was sampled from.
Inferential statistics help decide whether these characteristics are generalizable to the
population, primarily through a trade-off between the magnitude of the descriptive sta-
tistic (e.g. mean difference between two groups) and the size of the sample.
Descriptive statistics alone may be useful in describing some general trends. However,
in most cases, without inferential statistics it may not be clear whether the pattern
observed is genuine or resulting from chance. This applies to all descriptive statistics,
such as means, standard deviations, percentages, frequencies, and counts. Researchers
reporting any of these statistics should consider an appropriate inferential test before
making generalizations to the population. In certain situations, it might at first seem hard
to think of an appropriate inferential test, but it is the researcher’s responsibility to dem-
onstrate to the reader that the results are generalizable to the population. Ideally, the
decision on which test to use should be made at the design stage before conducting the
study (e.g. preregistration).
In our sample, we found three common situations where inferences were frequently
made without inferential statistics. The first was when only one sample was involved. In
such situations, the researcher might consider a one-sample test to tell whether the statis-
tic is significantly different from zero (as was done in the present study). In some cases,
this might be a mundane procedure, but it is a first step toward calculating the size of the
effect (see below), which is typically a more interesting question. The second situation
arising from our analysis had to do with count data. A number of articles reported counts
of certain phenomena (e.g. number of certain linguistic features in a corpus), and then
made inferences based on those counts. In such cases, the researcher might consider the
chi-square test for independent groups and the McNemar test for paired groups. Rayson,
Berridge, and Francis (2004) have also suggested the log likelihood test for comparing
observed corpus frequencies. The third, more subtle, situation was when researchers
compared two test statistics, such as two correlation coefficients. In these situations, the
738 Language Teaching Research 23(6)
two coefficients might be different but the question is whether this difference is itself
large enough to be statistically significant. In fact, even if one coefficient were signifi-
cant and the other were not, this would not be sufficient to conclude that the difference
between them would also be significant (see Gelman & Stern, 2006). In this case, Fisher’s
r-to-z transformation could be used.
Another issue arising from our results is incomplete reporting of results, including
non-significant ones. A number of articles presented their results in detail if they were
significant, but the non-significant results were abridged and presented only in passing.
Regardless of the outcome of the significance test, the results should be reported in full,
including descriptive statistics, test statistics, degrees of freedom (where relevant), p-val-
ues, and effect sizes. Failing to report non-significant results can lead to publication bias,
in which only significant result become available to the research community (Rothstein,
Sutton, & Borenstein, 2005). Failing to report non-significant results in full may also
preclude the report from inclusion in future meta-analyses.
In our analysis, we did not consider it a violation if confidence intervals were not
reported. Although there have been recent calls in the L2 field to report confidence inter-
vals to address some limitations of significance testing (e.g. Lindstromberg, 2016; Nassaji,
2012; Norris, 2015), this issue is less straightforward than it might seem at first. This is due
to the controversy surrounding the interpretation of confidence intervals. Since they were
first introduced (Neyman, 1937), confidence intervals have never been intended to repre-
sent either the uncertainty around the result, its precision, or its likely values. Morey,
Hoekstra, Rouder, Lee, and Wagenmakers (2016) refer to such interpretations as the fallacy
of placing confidence in confidence intervals, which is prevalent among students and
researchers alike (Hoekstra, Morey, Rouder, & Wagenmakers, 2014). As a matter of fact,
confidence intervals of a parameter refer to the interval that, in repeated sampling, has on
average a fixed (e.g. 95%) probability of containing that parameter. Confidence intervals
therefore concern the probability in the long run, and may not be related to the results from
the original study. Some statisticians have even gone as far as to describe confidence inter-
vals as ‘scientifically useless’ (Bolstad, 2007, p. 228). Whether the reader would agree with
this evaluation or consider it rather extreme, our aim is to point out that confidence inter-
vals are far from the unanimously accepted panacea for significance testing ills.
In contrast to confidence intervals, there is far more agreement among methodologists
on the need for reporting effect sizes to complement significance testing. The discussion
so far has mentioned significance and significance testing several times, probably giving
the impression of the meaningfulness of this procedure. Actually, a significant result is a
relatively trivial outcome. Because the null hypothesis is always false (Cohen, 1994),
just increase your sample size if you want a significant result! A p-value is the probability
of the data given the null hypothesis. When the result is significant at the .05 level, we
can conclude that the probability of obtaining this outcome by chance is only 5% if the
null hypothesis is true. It does not mean that the effect is big or strong. At the same time,
a non-significant result does not represent evidence against the hypothesis, as it is pos-
sible that the study was underpowered. To obtain evidence against a hypothesis, the
researcher needs to calculate the power of the test and then the effect size to demonstrate
that there is no ‘non-negligible’ effect (see Cohen et al., 2003; Lakens, 2013), or alterna-
tively use Bayes factors (see Dienes, 2014).
Al-Hoorie and Vitta 739
The final point discussed in this section is not adjusting for multiple comparisons. As
mentioned above, a p-value gives the probability of obtaining the data if the null is true.
It therefore does not refer to the probability of the hypothesis itself. With multiple com-
parisons, the likelihood of obtaining a significant result by mere chance no longer
remains at 5%, thus raising the risk of Type I error. One way to address this problem is
to implement an appropriate correction procedure (Larson-Hall, 2016; Ludbrook, 1998).
Another approach is to determine the specific tests to be conducted beforehand. Any
other analyses conducted should then be labeled explicitly as exploratory, since their
results could potentially reflect Type I error. Perhaps the worst a researcher could do in
this regard is to conduct various tests and report only those that turn out significant.
3 Other issues
In our analysis, a number of issues emerged related to not reporting assumption checks
before conducting particular procedures. This was placed in a separate category because
it is used here in a broad sense to refer to both descriptive and inferential statistics, as
well as psychometrics. For example, some articles used nonparametric tests due to non-
normality of the data but also reported the mean and standard deviation to describe their
data. Using the mean and standard deviation assumes that the data are normal. Many
articles also used inferential tests that require certain assumptions, such as normality and
linearity, but without assuring the reader that these assumptions were satisfied. Other
articles that performed factor analysis did not report factor loadings fully or address the
implications of cross-loadings. In many cases, many concerns can be addressed by sim-
ply making the dataset publicly available.
A particularly overlooked assumption is correlated errors. Many statistical procedures
assume that errors are uncorrelated. For example, when learners come from distinct
classes, learners from the same class will be more similar to each other than those from
different classes. When this happens, the sample is no longer random. As a consequence,
Type I error rate increases, such as when learners from one class in the group have higher
scores because of some unique feature of that class. In this case, the overall group mean
will be inflated because of one class only. The effect of violating this independence
assumption might be mild when there are only a few classes. But with more classes (e.g.
over 20), the effect could be more serious. One approach to deal with this situation is to
use classes as the unit of analysis by averaging the values for learners within each class.
Another approach is to use multilevel and mixed-effects modeling to model both higher
and lower units simultaneously (Hox, 2010).
This article has presented a review of 30 journals representative of the L2 field. The review
focused on statistical issues specifically, rather than methodological issues. It was not
intended to downplay the importance of methodological issues, such as an adequate sample
size based on a priori power calculation or including a control group in (quasi)experimental
designs. Instead, we focused on statistical issues only because they seemed relevant to a
broader section the field, including areas fraught with practical constraints.
740 Language Teaching Research 23(6)
The results showed that Scopus’s citation analysis metrics function as moderate pre-
dictors of L2 journal quality, accounting for around 11–17% of the observed variance in
journals’ statistical quality, thus providing no evidence in favor of the newly introduced
CiteScore over the other metrics, at least in our field. Another indicator of journal quality
is whether the journal is SSCI-indexed. SSCI’s JCR was not a significant predictor of
journal quality, probably due to the small variation among SSCI-indexed journals, most
of which showing high quality in the L2 field. The analysis also revealed a number of
prevalent statistical violations that were surveyed in this article. Future research should
investigate other aspects of journal quality (i.e. other than statistical) to examine their
relationship with journal indexing and citation analysis metrics.
The present study is not without limitations. Our sample of 30 journals was rather
small. However, we were limited by the available number of L2 journals that are indexed
by Scopus and SSCI. For this reason, we did not have the luxury to conduct a power
analysis and then obtain a sufficiently large sample. Nevertheless, our study is still one
of the largest quantitative surveys of L2 journals in the field to date. A further limitation
is whether selecting five articles from a journal would be truly representative of that
journal. In our case, in addition to aiming to investigate the most recent quantitative
trends in journals, we were also bounded by practical constraints. A total of 150 journal
articles to read and analyse is no easy feat. Instead of recommending that future research-
ers must have a larger sample than ours, an alternative approach is to conduct compara-
ble studies on more recent literature and then combine the results meta-analytically. This
would help build a cumulative science of journal quality in the field.
This research received no specific grant from any funding agency in the public, commercial, or
1. We do not claim that other journals not listed in Table 5 necessarily have lower quality
because our sample was not exhaustive of journals in the field and because it included only
five recent articles from each journal. In fact, even for journals listed in Table 5, we do not
recommend researchers interested in improving their statistical literacy to browse older issues
of these journals, since quality (and editorial policies) change over time.
Ali H. Al-Hoorie https://orcid.org/0000-0003-3810-5978
Joseph P. Vitta https://orcid.org/0000-0002-5711-969X
Benson, P., Chik, A., Gao, X., Huang, J., & Wang, W. (2009). Qualitative research in language
teaching and learning journals, 1997–2006. The Modern Language Journal, 93, 79–90.
Bolstad, W.M. (2007). Introduction to Bayesian Statistics. 2nd edition. Hoboken, NJ: Wiley.
Brown, J.D. (1990). The use of multiple t tests in language research. TESOL Quarterly, 24, 770–773.
Brown, J.D. (2004). Resources on quantitative/statistical research for applied linguists. Second
Language Research, 20, 372–393.
Al-Hoorie and Vitta 741
Brumback, R.A. (2009). Impact factor wars: Episode V: The empire strikes back. Journal of Child
Neurology, 24, 260–262.
Chadegani, A.A., Salehi, H., Yunus, M.M., et al. (2013). A comparison between two main aca-
demic literature collections: Web of science and scopus databases. Asian Social Science, 9,
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Cohen, J., Cohen, J., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences. 3rd edition. Mahwah, NJ: Lawrence Erlbaum.
Colledge, L., de Moya-Anegón, F., Guerrero-Bote, V., et al. (2010). SJR and SNIP: Two new
journal metrics in Elsevier’s Scopus. Serials, 23, 215–221.
da Silva, J.A.T., & Memon, A.R. (2017). CiteScore: A cite for sore eyes, or a valuable, transparent
metric? Scientometrics, 111, 553–556.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in
Psychology, 5. Available online at http://doi.org/10.3389/fpsyg.2014.00781 (assessed March
Egbert, J.O.Y. (2007). Quality analysis of journals in TESOL and applied linguistics. TESOL
Quarterly, 41, 157–171.
Garfield, E. (2006). The history and meaning of the journal impact factor. JAMA, 295, 90–93.
Gelman, A., & Stern, H.S. (2006). The difference between ‘significant’ and ‘not significant’ is not
itself statistically significant. The American Statistician, 60, 328–331.
Green, S.B., Lissitz, R.W., & Mulaik, S.A. (1977). Limitations of coefficient alpha as an index of
test unidimensionality. Educational and Psychological Measurement, 37, 827–838.
Guerrero-Bote, V.P., & Moya-Anegón, F. (2012). A further step forward in measuring journals’
scientific prestige: The SJR2 indicator. Journal of Informetrics, 6, 674–688.
Guz, A.N., & Rushchitsky, J.J. (2009). Scopus: A system for the evaluation of scientific journals.
International Applied Mechanics, 45, 351–362.
Hoekstra, R., Morey, R.D., Rouder, J.N., & Wagenmakers, E.-J. (2014). Robust misinterpretation
of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164.
Hong-Nam, K., & Leavell, A.G. (2006). Language learning strategy use of ESL students in an
intensive English learning context. System, 34, 399–415.
Hox, J.J. (2010). Multilevel analysis: Techniques and applications. 2nd edition. New York:
Jung, U.O.H. (2004). Paris in London revisited or the foreign language teacher’s top-most jour-
nals. System, 32, 357–361.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practi-
cal primer for t-tests and ANOVAs. Frontiers in Psychology, 4. Available online at http://doi.
org/10.3389/fpsyg.2013.00863 (assessed March 2018).
Larson-Hall, J. (2016). A guide to doing statistics in second language research using SPSS and R.
2nd edition. New York: Routledge.
Larson-Hall, J. (2017). Moving beyond the bar plot and the line graph to create informative and
attractive graphics. The Modern Language Journal, 101, 244–270.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings:
What gets reported and recommendations for the field. Language Learning, 65, 127–159.
Leydesdorff, L., & Opthof, T. (2010). Scopus’s source normalized impact per paper (SNIP) versus
a journal impact factor based on fractional counting of citations. Journal of the American
Society for Information Science and Technology, 61, 2365–2369.
Lindstromberg, S. (2016). Inferential statistics in Language Teaching Research: A review and
ways forward. Language Teaching Research, 20, 741–768.
742 Language Teaching Research 23(6)
Loewen, S., & Gass, S. (2009). The use of statistics in L2 acquisition research. Language Teaching,
Loewen, S., Crowther, D., Isbell, D., Lim, J., Maloney, J., & Tigchelaar, M. (2017). The statisti-
cal literacy of applied linguistics researchers. Unpublished paper presented at the American
Association for Applied Linguistics (AAAL), Portland, Oregon, USA.
Loewen, S., Lavolette, E., Spino, L.A., et al. (2014). Statistical literacy among applied linguists
and second language acquisition researchers. TESOL Quarterly, 48, 360–388.
Ludbrook, J. (1998). Multiple comparison procedures updated. Clinical and Experimental
Pharmacology and Physiology, 25, 1032–1037.
Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., & Wagenmakers, E.-J. (2016). The fallacy
of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23, 103–123.
Nassaji, H. (2012). Statistical significance tests and result generalizability: Issues, misconceptions,
and a case for replication. In G.K. Porte (Ed.), Replication research in applied linguistics (pp.
92–115). Cambridge: Cambridge University Press.
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of prob-
ability. Philosophical transactions of the Royal Society of London, Series A: Mathematical
and Physical Sciences, 236, 333–380.
Norris, J.M. (2015). Statistical significance testing in second language research: Basic problems
and suggestions for reform. Language Learning, 65, 97–126.
Norris, J.M., Plonsky, L., Ross, S.J., & Schoonen, R. (2015). Guidelines for reporting quantitative
methods and results in primary research. Language Learning, 65, 470–476.
Oxford, R.L. (1990). Language learning strategies: What every teacher should know. New York:
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting prac-
tices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodological
synthesis and call for reform. The Modern Language Journal, 98, 450–470.
Plonsky, L. (Ed.) (2015). Advancing quantitative methods in second language research. New
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes: The
case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Oswald, F.L. (2014). How big Is ‘big’? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Radwan, A.A. (2011). Effects of L2 proficiency and gender on choice of language learning strate-
gies by university students majoring in English. The Asian EFL Journal, 13, 115–163.
Rayson, P., Berridge, D., & Francis, B. (2004). Extending the Cochran rule for the comparison
of word frequencies between corpora. Unpublished paper presented at the 7th International
Conference on Statistical analysis of textual data (JADT 2004), Louvain-la-Neuve, Belgium.
Rothstein, H.R., Sutton, A.J., & Borenstein, M. (Eds.). (2005). Publication bias in meta-analysis:
Prevention, assessment and adjustments. Chichester: Wiley.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha.
Psychometrika, 74, 107–120.
Stevens, J. (2009). Applied multivariate statistics for the social sciences. 5th edition. New York:
Tabachnick, B.G., & Fidell, L.S. (2013). Using multivariate statistics. 6th edition. Boston, MA:
Tseng, W.-T., Dörnyei, Z., & Schmitt, N. (2006). A new approach to assessing strategic learning:
The case of self-regulation in vocabulary Acquisition. Applied Linguistics, 27, 78–102.
Al-Hoorie and Vitta 743
Vitta, J.P., & Al-Hoorie, A.H. (2017). Scopus- and SSCI-indexed L2 journals: A list for the Asia
TEFL community. The Journal of Asia TEFL, 14, 784-792.
Weiner, G. (2001). The academic journal: Has it a future? Education Policy Analysis Archives, 9.
Available online at http://doi.org/10.14507/epaa.v9n9.2001 (assessed March 2018).
Ali H. Al-Hoorie is assistant professor at the English Language Institute, Jubail Industrial College,
Saudi Arabia. He completed his PhD degree at the University of Nottingham, UK, under the super-
vision of Professors Zoltán Dörnyei and Norbert Schmitt. He also holds an MA in Social Science
Data Analysis from Essex University, UK. His research interests include motivation theory,
research methodology, and complexity.
Joseph P. Vitta is active in TESOL/Applied Linguistics research with interests and publications in
lexis, curriculum design, research methods, and computer-assisted language learning. As an ELT
professional, he has over 12 years’ experience as both a program manager and language teacher.
CALL, computer assisted language learning, EAP, English for academic purposes, EFL, English
as a foreign language, ELL, English language learner, ELT, English language teaching, ESP,
English for specific purposes, FLA, Foreign language acquisition, foreign language, language
acquisition, language assessment, language testing, language classroom, language curriculum, lan-
guage education, language educator, language learning, language learner, language learners, lan-
guage proficiency, language teaching, language teacher, language teachers, second language,
SLA, second language acquisition, TEFL, teaching English as a foreign language, TESL, teaching
English as a second language, TESOL, teaching English to speakers of other languages, teaching
1. Applied Linguistics
2. Asian EFL Journal
3. Asian ESP Journal
5. Computer Assisted Language Learning
6. Electronic Journal of Foreign Language Teaching
7. ELT Journal
8. English for Specific Purposes
9. Foreign Language Annals
10. Indonesian Journal of AL
11. Innovation in Language Learning and Teaching
12. International Review of Applied Linguistics in Language Teaching
13. Iranian Journal of Language Teaching Research
744 Language Teaching Research 23(6)
14. Journal of Asia TEFL
15. Journal of English for Academic Purposes
16. Journal of Second Language Writing
17. Language Assessment Quarterly
18. Language Learning
19. Language Learning & Technology
20. Language Learning Journal
21. Language Teaching Research
22. Language Testing
23. Modern Language Journal
25. Second Language Research
26. Studies in Second Language Acquisition
28. Teaching English with Technology
29. TESOL Quarterly
30. JALT CALL Journal