ArticlePDF Available

The seven sins of L2 research: A review of 30 journals’ statistical quality and their CiteScore, SJR, SNIP, JCR Impact Factors

Authors:

Abstract and Figures

This report presents a review of the statistical practices of 30 journals representative of the second language field. A review of 150 articles showed a number of prevalent statistical violations including incomplete reporting of reliability, validity, non-significant results, effect sizes, and assumption checks as well as making inferences from descriptive statistics and failing to correct for multiple comparisons. Scopus citation analysis metrics and whether a journal is SSCI-indexed were predictors of journal statistical quality. No clear evidence was obtained to favor the newly introduced CiteScore over SNIP or SJR. Implications of the results are discussed.
Content may be subject to copyright.
https://doi.org/10.1177/1362168818767191
Language Teaching Research
2019, Vol. 23(6) 727 –744
© The Author(s) 2018
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/1362168818767191
journals.sagepub.com/home/ltr
LANGUAGE
TEACHING
RESEARCH
The seven sins of L2 research:
A review of 30 journals’
statistical quality and their
CiteScore, SJR, SNIP, JCR
Impact Factors
Ali H. Al-Hoorie
Jubail Industrial College, Saudi Arabia
Joseph P. Vitta
Queen’s University Belfast, UK; Rikkyo University – College of Intercultural Communication, Japan
Abstract
This report presents a review of the statistical practices of 30 journals representative of the
second language field. A review of 150 articles showed a number of prevalent statistical violations
including incomplete reporting of reliability, validity, non-significant results, effect sizes, and
assumption checks as well as making inferences from descriptive statistics and failing to correct
for multiple comparisons. Scopus citation analysis metrics and whether a journal is SSCI-indexed
were predictors of journal statistical quality. No clear evidence was obtained to favor the newly
introduced CiteScore over SNIP or SJR. Implications of the results are discussed.
Keywords
citation analysis metrics, CiteScore, JCR Impact Factor, journal quality, quantitative research,
second language, SJR, SNIP
I Introduction
Second language (L2) researchers have long been interested in improving the quantita-
tive rigor in the field. In an early study, Brown (1990) pointed out the need to improve
quantitative quality in the field, singling out the importance of using ANOVA as opposed
Corresponding author:
Joseph P. Vitta, Queen’s University Belfast, University Road, Belfast BT7 1NN, UK.
Email: jvitta01@qub.ac.uk
767191LTR0010.1177/1362168818767191Language Teaching ResearchAl-Hoorie and Vitta
research-article2018
Article
728 Language Teaching Research 23(6)
to multiple t-tests. This recommendation may now be common knowledge to many
researchers, pointing to the fact that our field has made substantial progress in statistical
practices over the decades (see Plonsky, 2014). The field has now moved to relatively
more advanced topics, including the need for a priori power calculation and assumption
checking (e.g. Larson-Hall, 2016; Norris et al., 2015; Plonsky, 2015). The analysis of
quantitative practices in the field is currently an active area of research, as ‘there is no
controversy over the necessity of rigorous quantitative methods to advance the field of
SLA (Plonsky, 2013, p. 656).
In addition to the goal of improving quantitative quality, there has also been an inter-
est in overall journal quality, both within the L2 field and in academia in general (see
Egbert, 2007; Garfield, 2006; Vitta & Al-Hoorie, 2017). Indexing has often been seen as
a key factor in journal evaluation, with Scopus and the Web of Science being the two
indexes of prestige (Chadegani et al., 2013; Guz & Rushchitsky, 2009). Within these two
catalogues, citation analysis metrics have been employed as efficient measurements of
overall journal quality for some time. Garfield (2006), for example, noted that the Web
of Science’s Impact Factor has been in use since the 1950s as the index’s citation analysis
metric. At the same time, indexing and citation analysis have not universally been
accepted as a definitive way to assess journal quality within a field (e.g. Brumback,
2009; Egbert, 2007). This is primarily due to the notion that citation quantifies only one
aspect in the overall evaluation of a journal and might miss other, perhaps equally impor-
tant, aspects of journal quality.
Based on similar considerations, Plonsky (2013) has suggested the need for investi-
gating the relationship between statistical practices and journal quality in the L2 field. To
date, such an investigation does not seem to have been performed yet. The current study
therefore aimed to address this gap. A total of 150 quantitative articles from 30 L2 jour-
nals were assessed for quantitative rigor. The relationship between each journal’s quan-
titative quality and a number of popular journal quality measurements relating to citation
analysis and indexing were then examined.
II Overview
1 Quantitative rigor
In recognition of the importance of quantitative knowledge in the second language (L2)
field, a number of researchers have recently investigated various aspects related to meth-
odological and statistical competence of L2 researchers. Loewen and collegues (2014),
for instance, found that only 30% of professors in the field report satisfaction with their
level of statistical training, while only 14% doctoral students do so. Loewen and col-
leagues (2017) subsequently extended this line of research by using independent meas-
ures of quantitative competence, rather than simply relying on self-reported knowledge,
and also found significant gaps in the statistical literacy of L2 researchers. Furthermore,
quality of research design does not seem a priority for some scholars in the field when
evaluating the status of different journals (Egbert, 2007).
Inadequate quantitative knowledge is likely going to be evident in the field’s publica-
tions. In one of the first empirical assessments of the methodological quality in published
Al-Hoorie and Vitta 729
L2 research, Plonsky and Gass (2011) investigated quantitative reports in the L2 interac-
tion tradition spanning a period of around 30 years. They observed that ‘weaknesses in
the aggregate findings appear to outnumber the strengths’ (p. 349) and speculated
whether this is in part due to inadequate training of researchers. Subsequently, Plonsky
(2013, 2014) investigated articles published in two major journals in the field (Language
Learning and Studies in Second Language Acquisition). In line with previous findings,
these studies identified a number of limitations prevalent in the field, such as lack of
control in experimental designs and incomplete reporting of results. In a more recent
study, Lindstromberg (2016) examined articles published in one journal (Language
Teaching Research) over a period of about 20 years. This study also found issues similar
to those obtained in previous studies, such as incomplete reporting and overemphasizing
significance testing over effect sizes.
2 Unpacking the ‘sins’ of quantitative research
A large number of topics fall under quantitative research, making a parsimonious clas-
sification of these topics no easy task (see Stevens, 2009; Tabachnick & Fidell, 2013). In
this article, we adopt an approach similar to that used by Brown (2004), where statistical
topics are classified into three broad areas: psychometrics, inferential testing, and
assumption checking.
Psychometrics – subsuming reliability and validity – has been of major concern to
past research into the quantitative rigor in the field. Larson-Hall and Plonsky (2015)
pointed out that the most commonly used reliability coefficients in the field are
Cronbach’s α (for internal consistency) and Cohen’s κ (interrater reliability). Norris
et al. (2015) echoed the call for researchers to address reliability while also arguing
that researchers provide ‘validity evidence’ (p. 472). Validity evidence is of equal
importance to reliability, though validity is trickier since it lacks a straightforward
numeric value that researchers can report. For this reason, Norris et al. (2015) sug-
gested the use of conceptual validity evidence, such as pilot studies and linking instru-
ments to past research. Empirical validity evidence can also be utilized, such as factor
analysis for construct validity (Tabachnick & Fidell, 2013). In L2 research, reliability
has usually been emphasized, while validity considerations have been somewhat over-
looked. For example, studies using the Strategy Inventory for Language Learning
(Oxford, 1990) often reported internal reliability (see Hong-Nam & Leavell, 2006;
Radwan, 2011), despite calls for the need to also consider validity evidence for this
instrument (Tseng, Dörnyei, & Schmitt, 2006).
When it comes to inferential testing, a number of issues have been pointed out in
previous methodological research. One of these issues is, obviously, the need to use
inferential statistics. Descriptive statistics alone can sometimes be informative (Larson-
Hall, 2017), but for most purposes it is not clear whether an observed trend is merely a
natural fluctuation that should not be overinterpreted. Inferential statistics, by definition,
permits the researcher to generalize the observed trend from the sample at hand to the
population. In their investigation of interaction research, Plonsky and Gass (2011) report
that around 6% did not perform any inferential tests. Similarly, Plonsky (2013) reports in
a subsequent study that 12% of the studies in the sample did not use inferential testing.
730 Language Teaching Research 23(6)
A second issue concerning inferential testing has to do with complete reporting of the
results (Larson-Hall & Plonsky, 2015). A number of methodologists (e.g. Brown, 2004;
Nassaji, 2012; Norris et al., 2015; Plonsky, 2013; 2014, among others) have emphasized
that inferential tests must have all their relevant information presented for transparency
and replicability. In the case of t-tests, for example, readers would need information
regarding means, standard deviations, degrees of freedom, t-value, and p-value.
Effect sizes, which describe the magnitude of the relationship among variables, is a
further required aspect of quantitative rigor. In arguing for the centrality of effect sizes,
Norris (2015) asserted that L2 researchers have a tendency to conflate statistical signifi-
cance with practical significance. In the same vein, Nassaji (2012) posited that our field
has yet to firmly grasp that p-values only speak to Type I error probability and not the
strength of association between dependent and independent variables (see Cohen, Cohen,
West, & Aiken, 2003). Effect size reporting has therefore been stressed by L2 method-
ologists in recent years (e.g. Larson-Hall & Plonsky, 2015; Plonsky, 2013, 2014; Plonsky
& Gass, 2011; Plonsky & Oswald, 2014).
In addition to the above, a common situation in inferential testing is when researchers
perform several tests. Brown (1990) suggests that researchers should employ ANOVA as
an alternative to multiple t-tests in order to control Type I error rate. Norris and colleagues
(Norris, 2015; Norris et al., 2015) also highlight the need to perform a correction for alpha
level in multiple companions. A common procedure is the Bonferroni correction, where the
alpha level is divided by the number of tests performed. As an illustration, if 10 t-tests are
performed simultaneously in one study, the alpha level becomes .05/10 = .005. In this
example, a result is considered significant only if the p-value is less than .005. Procedures
that are less conservative than the Bonferroni correlation has also been proposed (e.g.
Holm–Bonferroni and Benjamini–Hochberg; see Larson-Hall, 2016; Ludbrook, 1998).
Finally, an essential consideration in quantitative research is checking that the necessary
assumptions are satisfied. Although it is typically classified under inferential statistics (i.e.
to determine whether parametric tests are appropriate), this point is placed in a separate
category in the present article for two reasons. First, checking assumptions is not limited to
inferential testing but also applies to descriptive statistics. For example, reporting the mean
and standard deviation assumes that the data are normally distributed. Otherwise, the mean
and standard deviation would not be representative of the central tendency and dispersion
of the data, respectively. Psychometrics also have assumptions that need to be satisfied. For
example, Cronbach’s alpha assumes that the construct is unidimensional (Green, Lissitz, &
Mulaik, 1977), or else its value could be inflated. Second, assumption checking seems
consistently overlooked in L2 research, despite repeated calls emphasizing its importance
(e.g. Lindstromberg, 2016; Loewen & Gass, 2009; Nassaji, 2012; Norris, 2015; Norris
et al., 2015; Plonsky, 2014; Plonsky & Gass, 2011). In the present study, the violations
reviewed above are called the seven ‘sins’ of quantitative research (see Table 1).
III Journal quality
Discussion of the methodological rigor of research articles ultimately speaks to the qual-
ity of the field’s journals. Assessment of journal quality is of particular importance
because of the value different stakeholders in both academic and professional arenas
Al-Hoorie and Vitta 731
place on research found in them (Weiner, 2001). Egbert (2007) surveyed multiple ways
to gauge journal quality in the L2 field, such as citation analysis, rejection rate, time to
publication, and expert opinion.
In reality, however, citation analysis has been one of the most common means of
evaluating different journals, probably because it offers a simple numeric impact value
to rank each journal (see Brumback, 2009; Colledge et al., 2010; Leydesdorff & Opthof,
2010). The history of citation analysis metrics dates back to the 1950s (see Garfield,
2006). Academia, in general, has been eager to embrace a means to gauge journal quality
empirically (Weiner, 2001). Currently, the most commonly used citation analysis metrics
are Source Normalized Impact per Paper (SNIP), SCImago Journal Rank (SJR), and
Journal Citation Reports (JCR) Impact Factor. The former two are maintained by
Elsevier’s Scopus, while JCR is maintained by Clarivate’s Web of Science (WOS; for-
merly Thompson-Reuters). WOS also maintains the Social Sciences Citation Index
(SSCI), which is the most relevant to our field.
Intense competition exists between these two indexing services, resulting in continuous
improvement of their metrics (see Chadegani et al., 2013; Guz & Rushchitsky, 2009). As
part of this development, Scopus has recently unveiled a new metric called CiteScore (see
da Silva & Memon, 2017), which is calculated in a similar way to JCR except that the look-
back period is three years rather than two. Table 2 presents an overview of these metrics.
Metrics judging journal quality through citation analysis have come under heavy criti-
cism (Brumback, 2009). Some have expressed doubt about the viability of using one
metric to assess the various dimensions contributing to a journal’s quality (Egbert, 2007).
Some of these metrics are proprietary and their actual calculations are not made public,
making them unverifiable. In fact, there have been reports of editors and publishers
negotiating and manipulating these metrics to improve journal rankings (see Brumback,
2009). Nevertheless, citation analysis metrics remain the primary means of assessing
journal quality, thus governing employment, tenure, funding, and livelihood for many
researchers around the world.
In the L2 field, there have been attempts to evaluate the different journals available.
In one study, Jung (2004) aimed to rank the prestige of a number of journals. Jung’s
primary criteria were being teacher-oriented and being indexed by the prestigious Centre
for Information on Language Teaching and Research, then hosted in Language Teaching
Table 1. The seven sins of quantitative research.
Area Violation
Psychometrics 1. Not reporting reliability
2. Not discussing validity
Inferential statistics 3. Making inferences from descriptive statistics
4. Incomplete reporting, including non-significant results
5. Not reporting effect sizes
6. Not adjusting for multiple comparisons
Other 7. Not reporting assumption checks
732 Language Teaching Research 23(6)
journal (n = 12). In another study, Egbert (2007) created a list of field-specific journals
primarily based on expert opinion (n = 35). Benson, Chik, Gao, Huang, and Wang (2009)
also employed a list of journals (n = 10) for the purpose of evaluating these journals in
relation to qualitative rigor. In all of these studies, journals’ citation analysis values have
not been systematically taken into consideration in evaluating these journals.
IV The present study
As reviewed above, previous research on quantitative quality in the L2 field has tended
to focus on specific publication venues (e.g. one or two particular journals) or specific
research traditions (e.g. L2 interaction). Studies with a wider scope, on the other hand,
have not focused on either quantitative quality or its relation to popular journal citation
analysis. In this study, we aimed to conduct a broader review covering a representative
sample of L2 journals. This would allow us to obtain a more general overview of study
quality in the L2 field, as well as to empirically assess the utility of the four citation
analysis metrics as a representation of quality of different journals.
Because of this wide coverage, we also narrowed our scope to focus specifically on
statistical issues rather than more general methodological issues. For example, prior
research has repeatedly shown that many L2 researchers usually overlook important con-
siderations such as power analysis and using a control group (e.g. Lindstromberg, 2016;
Plonsky, 2013, 2014). However, it must be noted that such considerations are design
issues that need to be addressed before conducting the study. Thus, having an adequate
sample size or a control group can sometimes be governed by practical and logistical
considerations that are outside the researcher’s control. Because of the wide coverage of
our sample of journals, involving various L2 sub-disciplines where practical limitations
Table 2. Four common journal citation analysis metrics and their characteristics.
Impact
factor
Indexing service Description
SNIP
Scopus
Number of citations to a journal’s articles in the past three
years divided by the total number of its articles in the
past three years. Normalized to facilitate cross-discipline
comparisons (Colledge et al., 2010).
SJR Essentially SNIP calculations that were additionally weighted,
depending on the rank of the citing journal, while excluding
self-citations. The weighting uses the PageRank algorithm
(Guerrero-Bote & Moya-Anegón, 2012).
CiteScore Total number of a journal’s citations in a given year divided by
the journal’s total number of citable publications during the
past three years (da Silva & Memon, 2017).
JCR WOS Total number of a journal’s citations in a given year divided by
the journal’s total number of citable publications during the
past two years (Garfield, 2006).
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; WOS = Web of Science.
Al-Hoorie and Vitta 733
may be an essential part of everyday research, we limited our review to data analysis
issues specifically. We aimed to answer three research questions:
1. What are the most common statistical violations found in L2 journals?
2. What is the relationship between the journal’s statistical quality and its citation
analysis scores (SNIP, SJR, CiteScore, and JCR)?
3. What is the relationship between the journal’s statistical quality and its indexing
(SSCI vs. Scopus)?
V Method
1 Inclusion criteria
In order to be included in this study, the journal had to satisfy the following criteria:
1. The journal is indexed by Scopus or SSCI.
2. The journal is within the second/foreign language learning and teaching area.
3. The journal presents original, quantitative research.
4. The journal uses English as its primary medium.
2 Journal search
The titles of journals indexed by Scopus and SSCI were examined against a list of
keywords representing various interests in the L2 field. Three experts were consulted
over two iterations to develop and validate the list of keywords (for the complete list,
see Appendix 1). A few well-known journals were not captured by our keywords (e.g.
System) and these were subsequently added. The final list of journals satisfying all
inclusion criteria included 30 journals (for the complete list, see Appendix 2). All jour-
nals were indexed in Scopus but only 19 were additionally indexed in SSCI. Our sam-
ple of 30 journals was larger than previous samples by Jung (2004, n = 12) and by
Benson et al. (2009, n = 10). It was slightly smaller than that by Egbert (2007, n = 35),
but this was primarily because our sample was limited to journals indexed in either
Scopus or SSCI. Therefore, it seems reasonable to argue that our sample is representa-
tive of journals in the L2 field.
Table 3 presents an overview of the journals in our sample and their citation analysis
scores. Both the Kolmogorov–Smirnov and Shapiro–Wilk tests for normality showed
that the four impact factors were normally distributed, ps > .05.
3 Data analysis
Five recent quantitative articles were randomly extracted from each of the 30 journals,
resulting in a total of 150 articles. All articles were published in 2016 or later, thus mak-
ing them representative of the latest quantitative trends in these journals. While a five-
article sample may not be fully representative of trends in a journal over time, our aim
was to investigate quantitative practices found in the most recent publications in the L2
734 Language Teaching Research 23(6)
field. This would allow us to find out the most common statistical violations in recent
literature. We discuss this issue further in the limitations of this study.
Each article was subsequently reviewed by two researchers independently against the
violations of statistical conventions described in Table 1. Controversial topics were
avoided, such as the adequacy of classical test theory vs. item response theory, or explor-
atory factor analysis vs. principal components analysis. In coding violations in each arti-
cle, repeated violations of the same issue were coded as one violation only, so violations
were coded in a binary format (present vs. not present) for each category. Interrater reli-
ability was high (96.7% agreement, κ = .91), and all discrepancies were resolved through
discussion until 100% agreement was reached.
VI Results
Violations were first averaged within each journal, and then the overall mean and stand-
ard deviation were computed (see Table 4). The table also reports the one-sample t-test
that examined whether the observed means were significantly different from zero. All
results were significant and lower than the Bonferroni-adjusted significance level (.05/7
= .007). The effect sizes were also generally substantial.
Since the maximum possible mean was 1.0 for each violation following our binary
coding, the means in the table can also be interpreted as a probability. As an illustration,
the probability of an article having issues with reliability is 24.7% (see Table 4). In other
words, almost one in every four articles would have a reliability issue. On the other hand,
almost every other article makes inferences from descriptive statistics (44%). Similarly,
just over one in three articles does not report effect sizes (38.7%).
Since the normality and linearity assumptions were satisfied, the correlations between
journals’ statistical quality and their citation analysis scores were examined to shed further
light on these results. There was a positive correlation between the statistical quality (vio-
lations were coded here so that a higher value indicated higher quality) of the 30 journals
and their Scopus citation analysis metrics (rSJR = .414, p = .023; rSNIP = .339, p = .067;
rCiteScore = .344, p = .062). In other words, SJR accounted for 17.1% of the variance
observed in journals’ statistical quality while SNIP and CiteScore accounted for 11.5%
and 11.2% of the observed variance, respectively. JCR impact factor was non-significant
for the 19 journals indexed by SSCI, r = –.129, p = .599, accounting for a negligible 1.7%
of the variance.
Table 3. Number of journals included and means and standard deviations of their 2015
citation analysis scores.
nImpact factor M (SD)
Scopus
30 SNIP 1.24 (0.78)
SJR 0.98 (0.77)
CiteScore 1.17 (0.84)
SSCI 19 JCR 1.42 (0.60)
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; SSCI = Social Sciences Citation Index.
Al-Hoorie and Vitta 735
Although the correlation between JCR and journal statistical quality was non-signifi-
cant, an independent samples t-test showed that non-SSCI-indexed journals had signifi-
cantly more violations (M = 11.27, SD = 3.35, n = 11) than SSCI-indexed ones (M = 7.89,
SD = 3.59, n = 19), t(28) = 2.54, p = .017, d = 0.97. These results suggest that SSCI-
indexed journals demonstrate higher quantitative rigor. Table 5 lists L2 journals indexed
by both Scopus and SSCI that demonstrated the fewest violations.1
VII Discussion
This article has presented the results of an analysis of 150 articles derived from a list of 30
journals representative of the L2 field. A number of statistical violations were observed
with varying degrees of frequency. The results also showed that Scopus citation analysis
metrics represent moderate predictors of statistical quality (accounting for around 11–
17% of the variance), with no evidence to favor the newly introduced CiteScore over
SNIP or SJR. Although these metrics account for less than 20% of the variance in the
observed quality of L2 journals, this magnitude might be considered reasonable, consider-
ing that statistical quality is only one dimension factoring in the overall quality of a jour-
nal. Other dimensions include non-statistical components of quantitative articles as well
as other types of articles such as qualitative and conceptual ones. Indeed, ‘a single method,
regardless of the number of components included, could not account for important differ-
ences among journals and in reasons for publishing in them’ (Egbert, 2007, p. 157).
The results also show that JCR was not a significant predictor of journal statistical
quality. This finding does not necessarily imply that this metric is not useful. It may mean
that L2 journals indexed by SSCI do not demonstrate sufficient variation for JCR to cap-
ture it. To become indexed by SSCI is typically a long process in which journals are
expected to demonstrate a certain level of quality. Indeed, the results above showed that
journals indexed in SSCI exhibited fewer statistical violations than non-SSCI journals.
Overall, these results suggest two ways to evaluate L2 journal quality: 1) the journal’s SJR
Table 4. Prevalence of the seven violations emerging from the analysis.
Theme M SD t p d
Psychometrics
1. Not reporting reliability 0.247 0.23 5.81 < .0001 1.06
2. Not discussing validity 0.087 0.13 3.81 .00066 0.70
Inferential statistics
3. Making inferences from descriptive
statistics
0.440 0.29 8.20 < .0001 1.50
4. Incomplete reporting, including non-
significant results
0.253 0.23 5.92 < .0001 1.08
5. Not reporting effect sizes 0.387 0.26 8.25 < .0001 1.51
6. Not adjusting for multiple comparisons 0.140 0.20 3.87 .00056 0.71
Other
7. Not reporting assumption checks 0.253 0.25 5.50 < .0001 1.00
736 Language Teaching Research 23(6)
value and 2) whether the journal is SSCI-indexed. The remainder of this article offers a
brief overview of the most common violations emerging from the present analysis.
1 Psychometrics
It is important for researchers to report details on the reliability and validity of their
instruments. In our analysis, a number of articles did not report the reliability of their
instruments, particularly dependent variables. In situations where manual coding is
involved, it is also important to examine and report interrater reliability as well. This
is especially important when a large amount of data is coded and subjectivity may be
involved.
When multiple scales are used (e.g. as part of a questionnaire), it is also important to
examine the factorial structure of the scales, whether using a procedure from classical
test theory or item response theory. It is common for researchers to adapt existing scales
for their own purposes but without conducting a factor analytic procedure to examine
convergent and discriminant validity of these scales. In some cases, even the original
developers of these scales did not investigate these issues. Some may argue that such
scales have been established in previous research. However, it would seem arbitrary to
require reporting reliability every time a scale is administered, but assume that other
psychometric characteristics can simply be imported from prior research. Reliability also
gives a very limited insight into the psychometric properties of a scale. Green et al.
(1977) offer examples of high reliability (e.g. over .80) that is a mere artifact of a long,
multidimensional scale, while Schmitt (1996) showed that low reliability (e.g. under .50)
is not necessarily problematic (see also Sijtsma, 2009, for a more detailed critique).
A discussion of validity is equally important. In our sample, a common situation
where a discussion of validity was lacking was in the use of authentic variables, such as
school grades. When the researcher draws from such authentic variables, it is typically
Table 5. Journals demonstrating highest quality in the present sample and their 2015 citation
analysis scores.
Journal Scopus SSCI
SNIP SJR CiteScore JCR
Computer Assisted Language Learning 1.54 1.26 1.64 1.72
English for Specific Purposes 2.73 1.66 2.11 1.14
International Review of Applied Linguistics in Language Teaching 0.97 0.91 0.95 0.80
Language Assessment Quarterly 0.63 1.07 0.93 0.98
Language Learning 2.54 2.47 2.58 1.87
Language Testing 1.36 1.44 1.50 0.91
Modern Language Journal 1.13 1.15 1.54 1.19
Studies in Second Language Acquisition 1.41 2.49 1.99 2.23
TESOL Quarterly 1.43 1.46 1.58 1.51
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; SSCI = Social Sciences Citation Index.
Al-Hoorie and Vitta 737
out of the researcher’s hand to control for their reliability and validity. In such cases,
readers would at least need a detailed description of the variable, its characteristics, and
the circumstances surrounding its measurement in order to evaluate its adequacy for the
purpose of the article, such as whether grades are a fair reflection of proficiency in the
context in question. This information may also be helpful in resolving inconsistent results
when they emerge from different studies.
When researchers develop their own instruments, extra work is required. Instrument
development should be treated as a crucial stage in a research project. Researchers devel-
oping a new instrument should perform adequate piloting to improve the psychometric
properties of the instrument before the actual study starts in order to convince the reader
of its outcomes. Poor instruments risk misleading results.
2 Inferential statistics
One of the most common issues arising in our analysis was the tendency to make infer-
ences from descriptive statistics. It is important to be aware of the distinction between
descriptive statistics and inferential statistics. Descriptive statistics refer to the character-
istics of the sample in hand. These characteristics could potentially be idiosyncratic to
this specific sample and not generalizable to the population it was sampled from.
Inferential statistics help decide whether these characteristics are generalizable to the
population, primarily through a trade-off between the magnitude of the descriptive sta-
tistic (e.g. mean difference between two groups) and the size of the sample.
Descriptive statistics alone may be useful in describing some general trends. However,
in most cases, without inferential statistics it may not be clear whether the pattern
observed is genuine or resulting from chance. This applies to all descriptive statistics,
such as means, standard deviations, percentages, frequencies, and counts. Researchers
reporting any of these statistics should consider an appropriate inferential test before
making generalizations to the population. In certain situations, it might at first seem hard
to think of an appropriate inferential test, but it is the researcher’s responsibility to dem-
onstrate to the reader that the results are generalizable to the population. Ideally, the
decision on which test to use should be made at the design stage before conducting the
study (e.g. preregistration).
In our sample, we found three common situations where inferences were frequently
made without inferential statistics. The first was when only one sample was involved. In
such situations, the researcher might consider a one-sample test to tell whether the statis-
tic is significantly different from zero (as was done in the present study). In some cases,
this might be a mundane procedure, but it is a first step toward calculating the size of the
effect (see below), which is typically a more interesting question. The second situation
arising from our analysis had to do with count data. A number of articles reported counts
of certain phenomena (e.g. number of certain linguistic features in a corpus), and then
made inferences based on those counts. In such cases, the researcher might consider the
chi-square test for independent groups and the McNemar test for paired groups. Rayson,
Berridge, and Francis (2004) have also suggested the log likelihood test for comparing
observed corpus frequencies. The third, more subtle, situation was when researchers
compared two test statistics, such as two correlation coefficients. In these situations, the
738 Language Teaching Research 23(6)
two coefficients might be different but the question is whether this difference is itself
large enough to be statistically significant. In fact, even if one coefficient were signifi-
cant and the other were not, this would not be sufficient to conclude that the difference
between them would also be significant (see Gelman & Stern, 2006). In this case, Fisher’s
r-to-z transformation could be used.
Another issue arising from our results is incomplete reporting of results, including
non-significant ones. A number of articles presented their results in detail if they were
significant, but the non-significant results were abridged and presented only in passing.
Regardless of the outcome of the significance test, the results should be reported in full,
including descriptive statistics, test statistics, degrees of freedom (where relevant), p-val-
ues, and effect sizes. Failing to report non-significant results can lead to publication bias,
in which only significant result become available to the research community (Rothstein,
Sutton, & Borenstein, 2005). Failing to report non-significant results in full may also
preclude the report from inclusion in future meta-analyses.
In our analysis, we did not consider it a violation if confidence intervals were not
reported. Although there have been recent calls in the L2 field to report confidence inter-
vals to address some limitations of significance testing (e.g. Lindstromberg, 2016; Nassaji,
2012; Norris, 2015), this issue is less straightforward than it might seem at first. This is due
to the controversy surrounding the interpretation of confidence intervals. Since they were
first introduced (Neyman, 1937), confidence intervals have never been intended to repre-
sent either the uncertainty around the result, its precision, or its likely values. Morey,
Hoekstra, Rouder, Lee, and Wagenmakers (2016) refer to such interpretations as the fallacy
of placing confidence in confidence intervals, which is prevalent among students and
researchers alike (Hoekstra, Morey, Rouder, & Wagenmakers, 2014). As a matter of fact,
confidence intervals of a parameter refer to the interval that, in repeated sampling, has on
average a fixed (e.g. 95%) probability of containing that parameter. Confidence intervals
therefore concern the probability in the long run, and may not be related to the results from
the original study. Some statisticians have even gone as far as to describe confidence inter-
vals as ‘scientifically useless’ (Bolstad, 2007, p. 228). Whether the reader would agree with
this evaluation or consider it rather extreme, our aim is to point out that confidence inter-
vals are far from the unanimously accepted panacea for significance testing ills.
In contrast to confidence intervals, there is far more agreement among methodologists
on the need for reporting effect sizes to complement significance testing. The discussion
so far has mentioned significance and significance testing several times, probably giving
the impression of the meaningfulness of this procedure. Actually, a significant result is a
relatively trivial outcome. Because the null hypothesis is always false (Cohen, 1994),
just increase your sample size if you want a significant result! A p-value is the probability
of the data given the null hypothesis. When the result is significant at the .05 level, we
can conclude that the probability of obtaining this outcome by chance is only 5% if the
null hypothesis is true. It does not mean that the effect is big or strong. At the same time,
a non-significant result does not represent evidence against the hypothesis, as it is pos-
sible that the study was underpowered. To obtain evidence against a hypothesis, the
researcher needs to calculate the power of the test and then the effect size to demonstrate
that there is no ‘non-negligible’ effect (see Cohen et al., 2003; Lakens, 2013), or alterna-
tively use Bayes factors (see Dienes, 2014).
Al-Hoorie and Vitta 739
The final point discussed in this section is not adjusting for multiple comparisons. As
mentioned above, a p-value gives the probability of obtaining the data if the null is true.
It therefore does not refer to the probability of the hypothesis itself. With multiple com-
parisons, the likelihood of obtaining a significant result by mere chance no longer
remains at 5%, thus raising the risk of Type I error. One way to address this problem is
to implement an appropriate correction procedure (Larson-Hall, 2016; Ludbrook, 1998).
Another approach is to determine the specific tests to be conducted beforehand. Any
other analyses conducted should then be labeled explicitly as exploratory, since their
results could potentially reflect Type I error. Perhaps the worst a researcher could do in
this regard is to conduct various tests and report only those that turn out significant.
3 Other issues
In our analysis, a number of issues emerged related to not reporting assumption checks
before conducting particular procedures. This was placed in a separate category because
it is used here in a broad sense to refer to both descriptive and inferential statistics, as
well as psychometrics. For example, some articles used nonparametric tests due to non-
normality of the data but also reported the mean and standard deviation to describe their
data. Using the mean and standard deviation assumes that the data are normal. Many
articles also used inferential tests that require certain assumptions, such as normality and
linearity, but without assuring the reader that these assumptions were satisfied. Other
articles that performed factor analysis did not report factor loadings fully or address the
implications of cross-loadings. In many cases, many concerns can be addressed by sim-
ply making the dataset publicly available.
A particularly overlooked assumption is correlated errors. Many statistical procedures
assume that errors are uncorrelated. For example, when learners come from distinct
classes, learners from the same class will be more similar to each other than those from
different classes. When this happens, the sample is no longer random. As a consequence,
Type I error rate increases, such as when learners from one class in the group have higher
scores because of some unique feature of that class. In this case, the overall group mean
will be inflated because of one class only. The effect of violating this independence
assumption might be mild when there are only a few classes. But with more classes (e.g.
over 20), the effect could be more serious. One approach to deal with this situation is to
use classes as the unit of analysis by averaging the values for learners within each class.
Another approach is to use multilevel and mixed-effects modeling to model both higher
and lower units simultaneously (Hox, 2010).
VIII Conclusions
This article has presented a review of 30 journals representative of the L2 field. The review
focused on statistical issues specifically, rather than methodological issues. It was not
intended to downplay the importance of methodological issues, such as an adequate sample
size based on a priori power calculation or including a control group in (quasi)experimental
designs. Instead, we focused on statistical issues only because they seemed relevant to a
broader section the field, including areas fraught with practical constraints.
740 Language Teaching Research 23(6)
The results showed that Scopus’s citation analysis metrics function as moderate pre-
dictors of L2 journal quality, accounting for around 11–17% of the observed variance in
journals’ statistical quality, thus providing no evidence in favor of the newly introduced
CiteScore over the other metrics, at least in our field. Another indicator of journal quality
is whether the journal is SSCI-indexed. SSCI’s JCR was not a significant predictor of
journal quality, probably due to the small variation among SSCI-indexed journals, most
of which showing high quality in the L2 field. The analysis also revealed a number of
prevalent statistical violations that were surveyed in this article. Future research should
investigate other aspects of journal quality (i.e. other than statistical) to examine their
relationship with journal indexing and citation analysis metrics.
The present study is not without limitations. Our sample of 30 journals was rather
small. However, we were limited by the available number of L2 journals that are indexed
by Scopus and SSCI. For this reason, we did not have the luxury to conduct a power
analysis and then obtain a sufficiently large sample. Nevertheless, our study is still one
of the largest quantitative surveys of L2 journals in the field to date. A further limitation
is whether selecting five articles from a journal would be truly representative of that
journal. In our case, in addition to aiming to investigate the most recent quantitative
trends in journals, we were also bounded by practical constraints. A total of 150 journal
articles to read and analyse is no easy feat. Instead of recommending that future research-
ers must have a larger sample than ours, an alternative approach is to conduct compara-
ble studies on more recent literature and then combine the results meta-analytically. This
would help build a cumulative science of journal quality in the field.
Funding
This research received no specific grant from any funding agency in the public, commercial, or
not-for-profit sectors.
Note
1. We do not claim that other journals not listed in Table 5 necessarily have lower quality
because our sample was not exhaustive of journals in the field and because it included only
five recent articles from each journal. In fact, even for journals listed in Table 5, we do not
recommend researchers interested in improving their statistical literacy to browse older issues
of these journals, since quality (and editorial policies) change over time.
ORCID iDs
Ali H. Al-Hoorie https://orcid.org/0000-0003-3810-5978
Joseph P. Vitta https://orcid.org/0000-0002-5711-969X
References
Benson, P., Chik, A., Gao, X., Huang, J., & Wang, W. (2009). Qualitative research in language
teaching and learning journals, 1997–2006. The Modern Language Journal, 93, 79–90.
Bolstad, W.M. (2007). Introduction to Bayesian Statistics. 2nd edition. Hoboken, NJ: Wiley.
Brown, J.D. (1990). The use of multiple t tests in language research. TESOL Quarterly, 24, 770–773.
Brown, J.D. (2004). Resources on quantitative/statistical research for applied linguists. Second
Language Research, 20, 372–393.
Al-Hoorie and Vitta 741
Brumback, R.A. (2009). Impact factor wars: Episode V: The empire strikes back. Journal of Child
Neurology, 24, 260–262.
Chadegani, A.A., Salehi, H., Yunus, M.M., et al. (2013). A comparison between two main aca-
demic literature collections: Web of science and scopus databases. Asian Social Science, 9,
18–26.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Cohen, J., Cohen, J., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences. 3rd edition. Mahwah, NJ: Lawrence Erlbaum.
Colledge, L., de Moya-Anegón, F., Guerrero-Bote, V., et al. (2010). SJR and SNIP: Two new
journal metrics in Elsevier’s Scopus. Serials, 23, 215–221.
da Silva, J.A.T., & Memon, A.R. (2017). CiteScore: A cite for sore eyes, or a valuable, transparent
metric? Scientometrics, 111, 553–556.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in
Psychology, 5. Available online at http://doi.org/10.3389/fpsyg.2014.00781 (assessed March
2018).
Egbert, J.O.Y. (2007). Quality analysis of journals in TESOL and applied linguistics. TESOL
Quarterly, 41, 157–171.
Garfield, E. (2006). The history and meaning of the journal impact factor. JAMA, 295, 90–93.
Gelman, A., & Stern, H.S. (2006). The difference between ‘significant’ and ‘not significant’ is not
itself statistically significant. The American Statistician, 60, 328–331.
Green, S.B., Lissitz, R.W., & Mulaik, S.A. (1977). Limitations of coefficient alpha as an index of
test unidimensionality. Educational and Psychological Measurement, 37, 827–838.
Guerrero-Bote, V.P., & Moya-Anegón, F. (2012). A further step forward in measuring journals’
scientific prestige: The SJR2 indicator. Journal of Informetrics, 6, 674–688.
Guz, A.N., & Rushchitsky, J.J. (2009). Scopus: A system for the evaluation of scientific journals.
International Applied Mechanics, 45, 351–362.
Hoekstra, R., Morey, R.D., Rouder, J.N., & Wagenmakers, E.-J. (2014). Robust misinterpretation
of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164.
Hong-Nam, K., & Leavell, A.G. (2006). Language learning strategy use of ESL students in an
intensive English learning context. System, 34, 399–415.
Hox, J.J. (2010). Multilevel analysis: Techniques and applications. 2nd edition. New York:
Routledge.
Jung, U.O.H. (2004). Paris in London revisited or the foreign language teacher’s top-most jour-
nals. System, 32, 357–361.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practi-
cal primer for t-tests and ANOVAs. Frontiers in Psychology, 4. Available online at http://doi.
org/10.3389/fpsyg.2013.00863 (assessed March 2018).
Larson-Hall, J. (2016). A guide to doing statistics in second language research using SPSS and R.
2nd edition. New York: Routledge.
Larson-Hall, J. (2017). Moving beyond the bar plot and the line graph to create informative and
attractive graphics. The Modern Language Journal, 101, 244–270.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings:
What gets reported and recommendations for the field. Language Learning, 65, 127–159.
Leydesdorff, L., & Opthof, T. (2010). Scopus’s source normalized impact per paper (SNIP) versus
a journal impact factor based on fractional counting of citations. Journal of the American
Society for Information Science and Technology, 61, 2365–2369.
Lindstromberg, S. (2016). Inferential statistics in Language Teaching Research: A review and
ways forward. Language Teaching Research, 20, 741–768.
742 Language Teaching Research 23(6)
Loewen, S., & Gass, S. (2009). The use of statistics in L2 acquisition research. Language Teaching,
42, 181–196.
Loewen, S., Crowther, D., Isbell, D., Lim, J., Maloney, J., & Tigchelaar, M. (2017). The statisti-
cal literacy of applied linguistics researchers. Unpublished paper presented at the American
Association for Applied Linguistics (AAAL), Portland, Oregon, USA.
Loewen, S., Lavolette, E., Spino, L.A., et al. (2014). Statistical literacy among applied linguists
and second language acquisition researchers. TESOL Quarterly, 48, 360–388.
Ludbrook, J. (1998). Multiple comparison procedures updated. Clinical and Experimental
Pharmacology and Physiology, 25, 1032–1037.
Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., & Wagenmakers, E.-J. (2016). The fallacy
of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23, 103–123.
Nassaji, H. (2012). Statistical significance tests and result generalizability: Issues, misconceptions,
and a case for replication. In G.K. Porte (Ed.), Replication research in applied linguistics (pp.
92–115). Cambridge: Cambridge University Press.
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of prob-
ability. Philosophical transactions of the Royal Society of London, Series A: Mathematical
and Physical Sciences, 236, 333–380.
Norris, J.M. (2015). Statistical significance testing in second language research: Basic problems
and suggestions for reform. Language Learning, 65, 97–126.
Norris, J.M., Plonsky, L., Ross, S.J., & Schoonen, R. (2015). Guidelines for reporting quantitative
methods and results in primary research. Language Learning, 65, 470–476.
Oxford, R.L. (1990). Language learning strategies: What every teacher should know. New York:
Newbury.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting prac-
tices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodological
synthesis and call for reform. The Modern Language Journal, 98, 450–470.
Plonsky, L. (Ed.) (2015). Advancing quantitative methods in second language research. New
York: Routledge.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes: The
case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Oswald, F.L. (2014). How big Is ‘big’? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Radwan, A.A. (2011). Effects of L2 proficiency and gender on choice of language learning strate-
gies by university students majoring in English. The Asian EFL Journal, 13, 115–163.
Rayson, P., Berridge, D., & Francis, B. (2004). Extending the Cochran rule for the comparison
of word frequencies between corpora. Unpublished paper presented at the 7th International
Conference on Statistical analysis of textual data (JADT 2004), Louvain-la-Neuve, Belgium.
Rothstein, H.R., Sutton, A.J., & Borenstein, M. (Eds.). (2005). Publication bias in meta-analysis:
Prevention, assessment and adjustments. Chichester: Wiley.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha.
Psychometrika, 74, 107–120.
Stevens, J. (2009). Applied multivariate statistics for the social sciences. 5th edition. New York:
Routledge.
Tabachnick, B.G., & Fidell, L.S. (2013). Using multivariate statistics. 6th edition. Boston, MA:
Pearson.
Tseng, W.-T., Dörnyei, Z., & Schmitt, N. (2006). A new approach to assessing strategic learning:
The case of self-regulation in vocabulary Acquisition. Applied Linguistics, 27, 78–102.
Al-Hoorie and Vitta 743
Vitta, J.P., & Al-Hoorie, A.H. (2017). Scopus- and SSCI-indexed L2 journals: A list for the Asia
TEFL community. The Journal of Asia TEFL, 14, 784-792.
Weiner, G. (2001). The academic journal: Has it a future? Education Policy Analysis Archives, 9.
Available online at http://doi.org/10.14507/epaa.v9n9.2001 (assessed March 2018).
Author biographies
Ali H. Al-Hoorie is assistant professor at the English Language Institute, Jubail Industrial College,
Saudi Arabia. He completed his PhD degree at the University of Nottingham, UK, under the super-
vision of Professors Zoltán Dörnyei and Norbert Schmitt. He also holds an MA in Social Science
Data Analysis from Essex University, UK. His research interests include motivation theory,
research methodology, and complexity.
Joseph P. Vitta is active in TESOL/Applied Linguistics research with interests and publications in
lexis, curriculum design, research methods, and computer-assisted language learning. As an ELT
professional, he has over 12 years’ experience as both a program manager and language teacher.
Appendix 1
Keywords
CALL, computer assisted language learning, EAP, English for academic purposes, EFL, English
as a foreign language, ELL, English language learner, ELT, English language teaching, ESP,
English for specific purposes, FLA, Foreign language acquisition, foreign language, language
acquisition, language assessment, language testing, language classroom, language curriculum, lan-
guage education, language educator, language learning, language learner, language learners, lan-
guage proficiency, language teaching, language teacher, language teachers, second language,
SLA, second language acquisition, TEFL, teaching English as a foreign language, TESL, teaching
English as a second language, TESOL, teaching English to speakers of other languages, teaching
English.
Appendix 2
Journals
1. Applied Linguistics
2. Asian EFL Journal
3. Asian ESP Journal
4. CALL-EJ
5. Computer Assisted Language Learning
6. Electronic Journal of Foreign Language Teaching
7. ELT Journal
8. English for Specific Purposes
9. Foreign Language Annals
10. Indonesian Journal of AL
11. Innovation in Language Learning and Teaching
12. International Review of Applied Linguistics in Language Teaching
13. Iranian Journal of Language Teaching Research
744 Language Teaching Research 23(6)
14. Journal of Asia TEFL
15. Journal of English for Academic Purposes
16. Journal of Second Language Writing
17. Language Assessment Quarterly
18. Language Learning
19. Language Learning & Technology
20. Language Learning Journal
21. Language Teaching Research
22. Language Testing
23. Modern Language Journal
24. ReCALL
25. Second Language Research
26. Studies in Second Language Acquisition
27. System
28. Teaching English with Technology
29. TESOL Quarterly
30. JALT CALL Journal
... No clear evidence was obtained favoring the newly introduced CiteScore over SNIP or SJR (Al-Hoorie and Vitta, 2019). [27] In the last decade, several journal editors have decided to publish alternative bibliometric indices parallel to the impact factor (IF): Scimago Journal Rank (SJR), Source Normalized Impact per Paper (SNIP), Eigenfactor Score (ES) and CiteScore; however, little is known about the correlations between them. more strongly affects CiteScore Y than the traditional JIF Y, with the citation window extended to cover Y − 3 to Y (Fang, 2021). ...
Article
Full-text available
Over the last few years, CiteScore has emerged as a popular metric to measure the performance of Journals. In this paper, we analyze CiteScores of the top 400 Scopus-indexed journals of 2021 for years from 2011 to 2021. Some interesting observations emerged from the analysis. The average CiteScore of the top 400 journals doubled from 16.48 in 2011 to 31.83 in 2021. At the same time, the standard deviation has almost trebled from 13.53 in 2011 to 38.18 in 2021. The CiteScores also show sizable increases for skewness and kurtosis, implying major variations in the CiteScores of the journals for a year. Importantly, the previous year's CiteScores strongly predict the next year's scores. This has been observed consistently for the last ten years. The average Pearson correlation coefficient between the preceding and succeeding years' CiteScores for the ten years is 0.98. We also show that it is easily possible for even people with just basic knowledge of computers to forecast the CiteScore. Researchers can predict CiteScores based on the past year's CiteScores and decide better about publishing their current research in a journal with an idea about its likely CiteScore. Such a forecast can be useful to publishers, editorial staff, indexing services, university authorities, and funding agencies.
... This assumption also applies to common statistical tests that are special cases of linear models, like the t-test and ANOVA, as well as to methods that extend these models, like linear mixed-effects models (Barker & Shaw, 2015;Casson & Farmer, 2014;Hox et al., 2018;Knief & Forstmeier, 2021;Pole & Bondy, 2012;Poncet et al., 2016;Rochon et al., 2012;Vallejo et al., 2021;Winter, 2019). Furthermore, it can apply to other quantitative methods, including inferential statistics, like confidence intervals (Alf & Lohr, 2007), and descriptive statistics, like mean and standard deviation (Al-Hoorie & Vitta, 2019). Additional information about this and other assumptions, particularly in the context of linear regression, appears in Appendix 1. ...
Article
Full-text available
Statistical methods generally have assumptions (e.g., normality in linear regression models). Violations of these assumptions can cause various issues, like statistical errors and biased estimates, whose impact can range from inconsequential to critical. Accordingly, it is important to check these assumptions, but this is often done in a flawed way. Here, I first present a prevalent but problematic approach to diagnostics—testing assumptions using null hypothesis significance tests (e.g., the Shapiro–Wilk test of normality). Then, I consolidate and illustrate the issues with this approach, primarily using simulations. These issues include statistical errors (i.e., false positives, especially with large samples, and false negatives, especially with small samples), false binarity, limited descriptiveness, misinterpretation (e.g., of p-value as an effect size), and potential testing failure due to unmet test assumptions. Finally, I synthesize the implications of these issues for statistical diagnostics, and provide practical recommendations for improving such diagnostics. Key recommendations include maintaining awareness of the issues with assumption tests (while recognizing they can be useful), using appropriate combinations of diagnostic methods (including visualization and effect sizes) while recognizing their limitations, and distinguishing between testing and checking assumptions. Additional recommendations include judging assumption violations as a complex spectrum (rather than a simplistic binary), using programmatic tools that increase replicability and decrease researcher degrees of freedom, and sharing the material and rationale involved in the diagnostics.
Article
Full-text available
Concerns have recently been raised about the validity of scales used in the L2 motivational self system tradition, particularly in relation to sufficient discriminant validity among some of its scales. These concerns highlight the need to systematically examine the validity of scales used in this tradition. In this study, we therefore compiled a list of 18 scales in widespread use and administered them to Korean learners of English ( N = 384). Testing the factorial structure of these scales using multiple exploratory and confirmatory factor-analytic criteria revealed severe discriminant validity issues. For example, the ideal L2 self was not discriminant from linguistic self-confidence, suggesting that participant responses to such ideal L2 self items is not driven by actual–ideal discrepancies as previously presumed but more likely by self-efficacy beliefs. We discuss these results in the context of the need to encourage systematic psychometric validation research in the language motivation field.
Article
Full-text available
Researchers from various academic fields have paid close attention to mindsets for many years, and second language acquisition is no exception. Previous studies in SLA introduced language-specific mindsets by developing the Language Mindsets Inventory scale to examine language learners' mindsets. Although a Japanese version of this scale has already been developed, there may be room for improvement in terms of translation and validation. The current study aimed to develop a more refined Japanese version of the scale through a careful translation process and a thorough examination of its reliability and validity. Participants in the current study were 179 EFL university students in Japan. Results of the reliability analysis showed that the new Japanese-translated scale with 12 items was highly reliable. Confirmatory factor analysis also showed that the four-factor model had a good fit and further evidence was added by examining the convergent, discriminant, and concurrent validity. The outcomes of the study imply that mindsets may not transcend subject areas which suggest the development of skill-specific mindset scales.
Article
Full-text available
Self-efficacy has emerged as a popular construct in second language research, especially in the frontline and practitioner-researcher spaces. A troubling trend in the relevant literature is that self-efficacy is often measured in a general or global manner. Such research ignores the fact that self-efficacy is a smaller context-driven construct that should be measured within a specific task or activity where time, place, and purpose domains are considered in the creation of the measurement. Task-based language teaching researchers have also largely neglected the affective factors that may influence task participation, including self-efficacy, despite its potential application to understanding task performance. In this report, we present an instrument specifically developed to measure English as a foreign language students' self-efficacy beliefs when performing a dialogic, synchronous, quasi-formal group discussion task. The instrument's underlying psychometric properties were assessed (N = 130; multisite sample from Japanese universities) and evidence suggested that it could measure a unidimensional construct with high reliability. The aggregate scale constructed from the instrument's items also displayed a central tendency and normal unimodal distribution. This was a positive finding and suggested that the instrument could be useful in producing a self-efficacy measurement for use in the testing designs preferred by second language researchers. The potential applications of this instrument are discussed while highlighting how this report acts as an illustration for investigators to use when researching self-efficacy.
Article
Full-text available
The process of assessing the scholarly output of academics is becoming increasingly challenging within the contemporary landscape of academia. Evaluation committees often extensively search multiple repositories to compile their evaluation summary report on an individual. Nevertheless, deriving performance metrics about a scholar's dynamics and progression poses a considerable challenge. This study introduces a novel computational approach utilizing unsupervised machine learning, which has the potential to serve as a valuable tool for committees tasked with evaluating the scholarly achievements of individuals across different universities of Pakistan, namely, Air, Quaid e Azam, International Islamic University, FAST, UET Taxila, COMSAT and NUST university. The proposed methodology generates a comprehensive set of key performance indicators (KPIs) for each researcher and monitors their progression over time. The considered variables are employed within a clustering framework, which uses clustering validity metrics to automatically ascertain the optimal number of clusters. This is done before the classification of scholars into distinct groups. The assignment of performance indicators to the clusters can ultimately function as the primary profile characteristics of the individuals within those clusters. This enables the deduction of a profile for each scholar. The present empirical investigation centres on analyzing rising or emerging stars who exhibit the greatest advancements over time concerning all Key KPIs. Additionally, this study can be utilized to assess the performance of scholarly groups.
Article
Full-text available
The present study assesses linguistic and geographic diversity in selected outlets of SLA and multilingualism research. Specifically, we examine over 2,000 articles published in specialized top-tier journals, recording the languages under study and their acquisition order, author affiliations, the country in which the research was conducted, and citations. In the sample, there were 183 unique languages and 174 unique pairings, corresponding to 3 per cent of the world's 7,000 languages and less than 0.001 per cent of 24.5 million possible language combinations. English was overwhelmingly the most common language, followed by Spanish and Mandarin Chinese. North America and Western Europe were both the main producers of knowledge and the main sites for research on multilingualism in the sample. Crucially, the regions with the highest levels of linguistic diversity and societal multilingualism (typically the Global South) were only marginally represented. The findings also show that studies on English and northern Anglophone settings were likely to elicit more citations than studies on other languages and settings, and that less studied languages were included more frequently in article titles.
Preprint
Full-text available
Complexity theory/Complex dynamic systems theory has challenged conventional approaches to applied linguistics research by encouraging researchers to adopt a pragmatic transdisciplinary approach that is less paradigmatic and more problem-oriented in nature. Its proponents have argued that the starting point in research design should not be the quantitative–qualitative distinction, and even mixed methods, but the distinction between individual- versus group-based designs (i.e., idiographic versus nomothetic). Taking insights from the application of complexity research in other human and social sciences, we propose an integrative transdisciplinary framework that unites these different perspectives (quantitative–qualitative, individual–group based) but makes the starting point exploratory–falsificatory aims. We discuss the implications of this transdisciplinary approach to applied linguistics research and illustrate how such an integrated approach might be implemented in the field.
Preprint
Full-text available
At the turn of the new millennium, in an article published in Language Teaching Research in 2000, Dörnyei and Kormos proposed that ‘active learner engagement is a key concern’ for all instructed language learning. Since then, language engagement research has increased exponentially. In this article, we present a systematic review of 20 years of language engagement research. To ensure robust coverage, we searched 21 major journals on second language acquisition (SLA) and applied linguistics and identified 112 reports satisfying our inclusion criteria. The results of our analysis of these reports highlighted the adoption of heterogeneous methods and conceptual frameworks in the language engagement literature, as well as indicating a need to refine the definitions and operationalizations of engagement in both quantitative and qualitative research. Based on these findings, we attempted to clarify some lingering ambiguity around fundamental definitions, and to more clearly delineate the scope and target of language engagement research. We also discuss future avenues to further advance understanding of the nature, mechanisms, and outcomes resulting from engagement in language learning.
Preprint
Full-text available
In this chapter, we provide an overview of various topics related to open science, drawing often (and necessarily) on work outside of applied linguistics. Here, we define open science more broadly. Rather than limiting open science to the question of whether a study and its data are open access or behind a paywall, this chapter defines open science more generally as transparency in all aspects of the research process (see also Gass et al., 2021). From this perspective, the most relevant discipline dealing with these issues is metascience (or meta-research).
Article
Full-text available
In many Asia TEFL contexts, emphasis in academic promotions and tenure is placed on journals listed in Scopus and the Social Sciences Citation Index (SSCI). However, these indexing services do not offer a subcategory specific to second/foreign language (L2) research and practice, apart from rather generic categories such as linguistics and education. To address this gap, this brief report attempts to construct a comprehensive list of L2 Scopus- and SSCI-indexed journals. In this article, we present this list and the two-stage process we followed to obtain it.
Article
Full-text available
This study investigates the use of language learning strategies by 128 students majoring in English at Sultan Qaboos University (SQU) in Oman. Using Oxford's (1990) Strategy Inventory for Language Learners (SILL), the study seeks to extend our current knowledge by examining the relationship between the use of language learning strategies (LLS) and gender and English proficiency, measured using a three-way criteria: students' grade point average (GPA) in English courses, study duration in the English Department, and students' perceived self-rating. It is as well a response to a call by Oxford to examine the relationship between LLSs and various factors in a variety of settings and cultural backgrounds (see Oxford, 1993). Results of a one-way analysis of variance (ANOVA) showed that the students used metacognitive strategies significantly more than any other category of strategies, with memory strategies ranking last on students' preference scale. Contrary to the findings of a number of studies (see e.g., Hong-Nam & Leavell, 2006), male students used more social strategies than female students, thus creating the only difference between the two groups in terms of their strategic preferences. Moreover, ANOVA results revealed that more proficient students used more cognitive, metacognitive and affective strategies than less proficient students. As for study duration, the results showed a curvilinear relationship between strategy use and study duration, where freshmen used more strategies followed by juniors, then seniors and sophomores, respectively. Analysis of the relationship between strategy use and self-rating revealed a sharp contrast between learners who are selfefficacious and those who are not, favoring the first group in basically every strategy category. To find out which type of strategy predicted learners' L2 proficiency, a backward stepwise logistic regression analysis was performed on students' data, revealing that use of cognitive strategies was the only predictor that distinguished between students with high GPAs and those with low GPAs. The present study suggests that the EFL cultural setting may be a factor that determines the type of strategies preferred by learners. This might be specifically true since some of the results obtained in this study vary from results of studies conducted in other cultural contexts. Results of this study may be used to inform pedagogical choices at university and even pre-university levels.
Article
To date, the journal impact factor (JIF), owned by Thomson Reuters (now Clarivate Analytics), has been the dominant metric in scholarly publishing. Hated or loved, the JIF dominated academic publishing for the better part of six decades. However, a rise in the ranks of unscholarly journals, academic corruption and fraud has also seen the accompaniment of a parallel universe of competing metrics, some of which might also be predatory, misleading, or fraudulent, while others yet may in fact be valid. On December 8, 2016, Elsevier B.V. launched a direct competing metric to the JIF, CiteScore (CS). This short communication explores the similarities and differences between JIF and CS. It also explores what this seismic shift in metrics culture might imply for journal readership and authorship.
Article
Graphics are often mistaken for a mere frill in the methodological arsenal of data analysis when in fact they can be one of the simplest and at the same time most powerful methods of communicating statistical information (Tufte, 2001). The first section of the article argues for the statistical necessity of graphs, echoing and amplifying similar calls from Hudson (2015) and Larson–Hall & Plonsky (2015). The second section presents a historical survey of graphical use over the entire history of language acquisition journals, spanning a total of 192 years. This shows that a consensus for using certain types of graphics, which lack data credibility, has developed in the applied linguistics field, namely the bar plot and the line graph. The final section of the article is devoted to presenting various types of graphic alternatives to these two consensus graphics. Suggested graphics are data accountable and present all of the data, as well as a summary structure; such graphics include the scatterplot, beeswarm or pirate plot. It is argued that the use of such graphics attracts readers, helps researchers improve the way they understand and analyze their data, and builds credibility in numerical statistical analyses and conclusions that are drawn.
Article
This article reviews all (quasi)experimental studies appearing in the first 19 volumes (1997–2015) of Language Teaching Research (LTR). Specifically, it provides an overview of how statistical analyses were conducted in these studies and of how the analyses were reported. The overall conclusion is that there has been a tight adherence to traditional methods and practices, some of which are suboptimal. Accordingly, a number of improvements are recommended. Topics covered include the implications of small average sample sizes, the unsuitability of p values as indicators of replicability, statistical power and implications of low power, the non-robustness of the most commonly used significance tests, the benefits of reporting standardized effect sizes such as Cohen’s d, options regarding control of the familywise Type I error rate, analytic options in pretest–posttest designs, ‘meta-analytic thinking’ and its benefits, and the mistaken use of a significance test to show that treatment groups are equivalent at pretest. An online companion article elaborates on some of these topics plus a few additional ones and offers guidelines, recommendations, and additional background discussion for researchers intending to submit to LTR an article reporting a (quasi)experimental study.
Article
It is common to summarize statistical comparisons by declarations of statistical significance or nonsignificance. Here we discuss one problem with such declarations, namely that changes in statistical significance are often not themselves statistically significant. By this, we are not merely making the commonplace observation that any particular threshold is arbitrary—for example, only a small change is required to move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities. The error we describe is conceptually different from other oft-cited problems—that statistical significance is not the same as practical importance, that dichotomization into significant and nonsignificant results encourages the dismissal of observed differences in favor of the usually less interesting null hypothesis of no difference, and that any particular threshold for declaring significance is arbitrary. We are troubled by all of these concerns and do not intend to minimize their importance. Rather, our goal is to bring attention to this additional error of interpretation. We illustrate with a theoretical example and two applied examples. The ubiquity of this statistical error leads us to suggest that students and practitioners be made more aware that the difference between “significant” and “not significant” is not itself statistically significant.
Article
Traditions of statistical significance testing in second language (L2) quantitative research are strongly entrenched in how researchers design studies, select analyses, and interpret results. However, statistical significance tests using p values are commonly misinterpreted by researchers, reviewers, readers, and others, leading to confusion regarding the actual findings of primary studies and critical challenges for the accumulation of meaningful knowledge about language learning research. This paper outlines the basic challenges of accurately calculating and interpreting statistical significance tests, explores common examples of incorrect interpretations in L2 research, and proposes strategies for resolving these problems.
Article
This paper presents a set of guidelines for reporting on five types of quantitative data issues: (1) Descriptive statistics, (2) Effect sizes and confidence intervals, (3) Instrument reliability, (4) Visual displays of data, and (5) Raw data. Our recommendations are derived mainly from various professional sources related to L2 research but motivated by results from investigations into how well the field as a whole is following these guidelines for best methodological practices, and illustrated by L2 examples. Although recent surveys of L2 reporting practices have found that more researchers are including important data such as effect sizes, confidence intervals, reliability coefficients, research questions, a priori alpha levels, graphics, and so forth in their research reports, we call for further improvement so that research findings may build upon each other and lend themselves to meta‐analyses and a mindset that sees each research project in the context of a coherent whole.
Article
Adequate reporting of quantitative research about language learning involves careful consideration of the logic, rationale, and actions underlying both study designs and the ways in which data are analyzed. These guidelines, commissioned and vetted by the board of directors of Language Learning, outline the basic expectations for reporting of quantitative primary research with a specific focus on Method and Results sections. The guidelines are based on issues raised in: Norris, J. M., Ross, S., & Schoonen, R. (Eds.). (2015). Improving and extending quantitative reasoning in second language research. Currents in Language Learning, volume 2. Oxford, UK.