ArticlePDF Available

The seven sins of L2 research: A review of 30 journals’ statistical quality and their CiteScore, SJR, SNIP, JCR Impact Factors

Authors:

Abstract and Figures

This report presents a review of the statistical practices of 30 journals representative of the second language field. A review of 150 articles showed a number of prevalent statistical violations including incomplete reporting of reliability, validity, non-significant results, effect sizes, and assumption checks as well as making inferences from descriptive statistics and failing to correct for multiple comparisons. Scopus citation analysis metrics and whether a journal is SSCI-indexed were predictors of journal statistical quality. No clear evidence was obtained to favor the newly introduced CiteScore over SNIP or SJR. Implications of the results are discussed.
Content may be subject to copyright.
https://doi.org/10.1177/1362168818767191
Language Teaching Research
2019, Vol. 23(6) 727 –744
© The Author(s) 2018
Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/1362168818767191
journals.sagepub.com/home/ltr
LANGUAGE
TEACHING
RESEARCH
The seven sins of L2 research:
A review of 30 journals’
statistical quality and their
CiteScore, SJR, SNIP, JCR
Impact Factors
Ali H. Al-Hoorie
Jubail Industrial College, Saudi Arabia
Joseph P. Vitta
Queen’s University Belfast, UK; Rikkyo University – College of Intercultural Communication, Japan
Abstract
This report presents a review of the statistical practices of 30 journals representative of the
second language field. A review of 150 articles showed a number of prevalent statistical violations
including incomplete reporting of reliability, validity, non-significant results, effect sizes, and
assumption checks as well as making inferences from descriptive statistics and failing to correct
for multiple comparisons. Scopus citation analysis metrics and whether a journal is SSCI-indexed
were predictors of journal statistical quality. No clear evidence was obtained to favor the newly
introduced CiteScore over SNIP or SJR. Implications of the results are discussed.
Keywords
citation analysis metrics, CiteScore, JCR Impact Factor, journal quality, quantitative research,
second language, SJR, SNIP
I Introduction
Second language (L2) researchers have long been interested in improving the quantita-
tive rigor in the field. In an early study, Brown (1990) pointed out the need to improve
quantitative quality in the field, singling out the importance of using ANOVA as opposed
Corresponding author:
Joseph P. Vitta, Queen’s University Belfast, University Road, Belfast BT7 1NN, UK.
Email: jvitta01@qub.ac.uk
767191LTR0010.1177/1362168818767191Language Teaching ResearchAl-Hoorie and Vitta
research-article2018
Article
728 Language Teaching Research 23(6)
to multiple t-tests. This recommendation may now be common knowledge to many
researchers, pointing to the fact that our field has made substantial progress in statistical
practices over the decades (see Plonsky, 2014). The field has now moved to relatively
more advanced topics, including the need for a priori power calculation and assumption
checking (e.g. Larson-Hall, 2016; Norris et al., 2015; Plonsky, 2015). The analysis of
quantitative practices in the field is currently an active area of research, as ‘there is no
controversy over the necessity of rigorous quantitative methods to advance the field of
SLA (Plonsky, 2013, p. 656).
In addition to the goal of improving quantitative quality, there has also been an inter-
est in overall journal quality, both within the L2 field and in academia in general (see
Egbert, 2007; Garfield, 2006; Vitta & Al-Hoorie, 2017). Indexing has often been seen as
a key factor in journal evaluation, with Scopus and the Web of Science being the two
indexes of prestige (Chadegani et al., 2013; Guz & Rushchitsky, 2009). Within these two
catalogues, citation analysis metrics have been employed as efficient measurements of
overall journal quality for some time. Garfield (2006), for example, noted that the Web
of Science’s Impact Factor has been in use since the 1950s as the index’s citation analysis
metric. At the same time, indexing and citation analysis have not universally been
accepted as a definitive way to assess journal quality within a field (e.g. Brumback,
2009; Egbert, 2007). This is primarily due to the notion that citation quantifies only one
aspect in the overall evaluation of a journal and might miss other, perhaps equally impor-
tant, aspects of journal quality.
Based on similar considerations, Plonsky (2013) has suggested the need for investi-
gating the relationship between statistical practices and journal quality in the L2 field. To
date, such an investigation does not seem to have been performed yet. The current study
therefore aimed to address this gap. A total of 150 quantitative articles from 30 L2 jour-
nals were assessed for quantitative rigor. The relationship between each journal’s quan-
titative quality and a number of popular journal quality measurements relating to citation
analysis and indexing were then examined.
II Overview
1 Quantitative rigor
In recognition of the importance of quantitative knowledge in the second language (L2)
field, a number of researchers have recently investigated various aspects related to meth-
odological and statistical competence of L2 researchers. Loewen and collegues (2014),
for instance, found that only 30% of professors in the field report satisfaction with their
level of statistical training, while only 14% doctoral students do so. Loewen and col-
leagues (2017) subsequently extended this line of research by using independent meas-
ures of quantitative competence, rather than simply relying on self-reported knowledge,
and also found significant gaps in the statistical literacy of L2 researchers. Furthermore,
quality of research design does not seem a priority for some scholars in the field when
evaluating the status of different journals (Egbert, 2007).
Inadequate quantitative knowledge is likely going to be evident in the field’s publica-
tions. In one of the first empirical assessments of the methodological quality in published
Al-Hoorie and Vitta 729
L2 research, Plonsky and Gass (2011) investigated quantitative reports in the L2 interac-
tion tradition spanning a period of around 30 years. They observed that ‘weaknesses in
the aggregate findings appear to outnumber the strengths’ (p. 349) and speculated
whether this is in part due to inadequate training of researchers. Subsequently, Plonsky
(2013, 2014) investigated articles published in two major journals in the field (Language
Learning and Studies in Second Language Acquisition). In line with previous findings,
these studies identified a number of limitations prevalent in the field, such as lack of
control in experimental designs and incomplete reporting of results. In a more recent
study, Lindstromberg (2016) examined articles published in one journal (Language
Teaching Research) over a period of about 20 years. This study also found issues similar
to those obtained in previous studies, such as incomplete reporting and overemphasizing
significance testing over effect sizes.
2 Unpacking the ‘sins’ of quantitative research
A large number of topics fall under quantitative research, making a parsimonious clas-
sification of these topics no easy task (see Stevens, 2009; Tabachnick & Fidell, 2013). In
this article, we adopt an approach similar to that used by Brown (2004), where statistical
topics are classified into three broad areas: psychometrics, inferential testing, and
assumption checking.
Psychometrics – subsuming reliability and validity – has been of major concern to
past research into the quantitative rigor in the field. Larson-Hall and Plonsky (2015)
pointed out that the most commonly used reliability coefficients in the field are
Cronbach’s α (for internal consistency) and Cohen’s κ (interrater reliability). Norris
et al. (2015) echoed the call for researchers to address reliability while also arguing
that researchers provide ‘validity evidence’ (p. 472). Validity evidence is of equal
importance to reliability, though validity is trickier since it lacks a straightforward
numeric value that researchers can report. For this reason, Norris et al. (2015) sug-
gested the use of conceptual validity evidence, such as pilot studies and linking instru-
ments to past research. Empirical validity evidence can also be utilized, such as factor
analysis for construct validity (Tabachnick & Fidell, 2013). In L2 research, reliability
has usually been emphasized, while validity considerations have been somewhat over-
looked. For example, studies using the Strategy Inventory for Language Learning
(Oxford, 1990) often reported internal reliability (see Hong-Nam & Leavell, 2006;
Radwan, 2011), despite calls for the need to also consider validity evidence for this
instrument (Tseng, Dörnyei, & Schmitt, 2006).
When it comes to inferential testing, a number of issues have been pointed out in
previous methodological research. One of these issues is, obviously, the need to use
inferential statistics. Descriptive statistics alone can sometimes be informative (Larson-
Hall, 2017), but for most purposes it is not clear whether an observed trend is merely a
natural fluctuation that should not be overinterpreted. Inferential statistics, by definition,
permits the researcher to generalize the observed trend from the sample at hand to the
population. In their investigation of interaction research, Plonsky and Gass (2011) report
that around 6% did not perform any inferential tests. Similarly, Plonsky (2013) reports in
a subsequent study that 12% of the studies in the sample did not use inferential testing.
730 Language Teaching Research 23(6)
A second issue concerning inferential testing has to do with complete reporting of the
results (Larson-Hall & Plonsky, 2015). A number of methodologists (e.g. Brown, 2004;
Nassaji, 2012; Norris et al., 2015; Plonsky, 2013; 2014, among others) have emphasized
that inferential tests must have all their relevant information presented for transparency
and replicability. In the case of t-tests, for example, readers would need information
regarding means, standard deviations, degrees of freedom, t-value, and p-value.
Effect sizes, which describe the magnitude of the relationship among variables, is a
further required aspect of quantitative rigor. In arguing for the centrality of effect sizes,
Norris (2015) asserted that L2 researchers have a tendency to conflate statistical signifi-
cance with practical significance. In the same vein, Nassaji (2012) posited that our field
has yet to firmly grasp that p-values only speak to Type I error probability and not the
strength of association between dependent and independent variables (see Cohen, Cohen,
West, & Aiken, 2003). Effect size reporting has therefore been stressed by L2 method-
ologists in recent years (e.g. Larson-Hall & Plonsky, 2015; Plonsky, 2013, 2014; Plonsky
& Gass, 2011; Plonsky & Oswald, 2014).
In addition to the above, a common situation in inferential testing is when researchers
perform several tests. Brown (1990) suggests that researchers should employ ANOVA as
an alternative to multiple t-tests in order to control Type I error rate. Norris and colleagues
(Norris, 2015; Norris et al., 2015) also highlight the need to perform a correction for alpha
level in multiple companions. A common procedure is the Bonferroni correction, where the
alpha level is divided by the number of tests performed. As an illustration, if 10 t-tests are
performed simultaneously in one study, the alpha level becomes .05/10 = .005. In this
example, a result is considered significant only if the p-value is less than .005. Procedures
that are less conservative than the Bonferroni correlation has also been proposed (e.g.
Holm–Bonferroni and Benjamini–Hochberg; see Larson-Hall, 2016; Ludbrook, 1998).
Finally, an essential consideration in quantitative research is checking that the necessary
assumptions are satisfied. Although it is typically classified under inferential statistics (i.e.
to determine whether parametric tests are appropriate), this point is placed in a separate
category in the present article for two reasons. First, checking assumptions is not limited to
inferential testing but also applies to descriptive statistics. For example, reporting the mean
and standard deviation assumes that the data are normally distributed. Otherwise, the mean
and standard deviation would not be representative of the central tendency and dispersion
of the data, respectively. Psychometrics also have assumptions that need to be satisfied. For
example, Cronbach’s alpha assumes that the construct is unidimensional (Green, Lissitz, &
Mulaik, 1977), or else its value could be inflated. Second, assumption checking seems
consistently overlooked in L2 research, despite repeated calls emphasizing its importance
(e.g. Lindstromberg, 2016; Loewen & Gass, 2009; Nassaji, 2012; Norris, 2015; Norris
et al., 2015; Plonsky, 2014; Plonsky & Gass, 2011). In the present study, the violations
reviewed above are called the seven ‘sins’ of quantitative research (see Table 1).
III Journal quality
Discussion of the methodological rigor of research articles ultimately speaks to the qual-
ity of the field’s journals. Assessment of journal quality is of particular importance
because of the value different stakeholders in both academic and professional arenas
Al-Hoorie and Vitta 731
place on research found in them (Weiner, 2001). Egbert (2007) surveyed multiple ways
to gauge journal quality in the L2 field, such as citation analysis, rejection rate, time to
publication, and expert opinion.
In reality, however, citation analysis has been one of the most common means of
evaluating different journals, probably because it offers a simple numeric impact value
to rank each journal (see Brumback, 2009; Colledge et al., 2010; Leydesdorff & Opthof,
2010). The history of citation analysis metrics dates back to the 1950s (see Garfield,
2006). Academia, in general, has been eager to embrace a means to gauge journal quality
empirically (Weiner, 2001). Currently, the most commonly used citation analysis metrics
are Source Normalized Impact per Paper (SNIP), SCImago Journal Rank (SJR), and
Journal Citation Reports (JCR) Impact Factor. The former two are maintained by
Elsevier’s Scopus, while JCR is maintained by Clarivate’s Web of Science (WOS; for-
merly Thompson-Reuters). WOS also maintains the Social Sciences Citation Index
(SSCI), which is the most relevant to our field.
Intense competition exists between these two indexing services, resulting in continuous
improvement of their metrics (see Chadegani et al., 2013; Guz & Rushchitsky, 2009). As
part of this development, Scopus has recently unveiled a new metric called CiteScore (see
da Silva & Memon, 2017), which is calculated in a similar way to JCR except that the look-
back period is three years rather than two. Table 2 presents an overview of these metrics.
Metrics judging journal quality through citation analysis have come under heavy criti-
cism (Brumback, 2009). Some have expressed doubt about the viability of using one
metric to assess the various dimensions contributing to a journal’s quality (Egbert, 2007).
Some of these metrics are proprietary and their actual calculations are not made public,
making them unverifiable. In fact, there have been reports of editors and publishers
negotiating and manipulating these metrics to improve journal rankings (see Brumback,
2009). Nevertheless, citation analysis metrics remain the primary means of assessing
journal quality, thus governing employment, tenure, funding, and livelihood for many
researchers around the world.
In the L2 field, there have been attempts to evaluate the different journals available.
In one study, Jung (2004) aimed to rank the prestige of a number of journals. Jung’s
primary criteria were being teacher-oriented and being indexed by the prestigious Centre
for Information on Language Teaching and Research, then hosted in Language Teaching
Table 1. The seven sins of quantitative research.
Area Violation
Psychometrics 1. Not reporting reliability
2. Not discussing validity
Inferential statistics 3. Making inferences from descriptive statistics
4. Incomplete reporting, including non-significant results
5. Not reporting effect sizes
6. Not adjusting for multiple comparisons
Other 7. Not reporting assumption checks
732 Language Teaching Research 23(6)
journal (n = 12). In another study, Egbert (2007) created a list of field-specific journals
primarily based on expert opinion (n = 35). Benson, Chik, Gao, Huang, and Wang (2009)
also employed a list of journals (n = 10) for the purpose of evaluating these journals in
relation to qualitative rigor. In all of these studies, journals’ citation analysis values have
not been systematically taken into consideration in evaluating these journals.
IV The present study
As reviewed above, previous research on quantitative quality in the L2 field has tended
to focus on specific publication venues (e.g. one or two particular journals) or specific
research traditions (e.g. L2 interaction). Studies with a wider scope, on the other hand,
have not focused on either quantitative quality or its relation to popular journal citation
analysis. In this study, we aimed to conduct a broader review covering a representative
sample of L2 journals. This would allow us to obtain a more general overview of study
quality in the L2 field, as well as to empirically assess the utility of the four citation
analysis metrics as a representation of quality of different journals.
Because of this wide coverage, we also narrowed our scope to focus specifically on
statistical issues rather than more general methodological issues. For example, prior
research has repeatedly shown that many L2 researchers usually overlook important con-
siderations such as power analysis and using a control group (e.g. Lindstromberg, 2016;
Plonsky, 2013, 2014). However, it must be noted that such considerations are design
issues that need to be addressed before conducting the study. Thus, having an adequate
sample size or a control group can sometimes be governed by practical and logistical
considerations that are outside the researcher’s control. Because of the wide coverage of
our sample of journals, involving various L2 sub-disciplines where practical limitations
Table 2. Four common journal citation analysis metrics and their characteristics.
Impact
factor
Indexing service Description
SNIP
Scopus
Number of citations to a journal’s articles in the past three
years divided by the total number of its articles in the
past three years. Normalized to facilitate cross-discipline
comparisons (Colledge et al., 2010).
SJR Essentially SNIP calculations that were additionally weighted,
depending on the rank of the citing journal, while excluding
self-citations. The weighting uses the PageRank algorithm
(Guerrero-Bote & Moya-Anegón, 2012).
CiteScore Total number of a journal’s citations in a given year divided by
the journal’s total number of citable publications during the
past three years (da Silva & Memon, 2017).
JCR WOS Total number of a journal’s citations in a given year divided by
the journal’s total number of citable publications during the
past two years (Garfield, 2006).
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; WOS = Web of Science.
Al-Hoorie and Vitta 733
may be an essential part of everyday research, we limited our review to data analysis
issues specifically. We aimed to answer three research questions:
1. What are the most common statistical violations found in L2 journals?
2. What is the relationship between the journal’s statistical quality and its citation
analysis scores (SNIP, SJR, CiteScore, and JCR)?
3. What is the relationship between the journal’s statistical quality and its indexing
(SSCI vs. Scopus)?
V Method
1 Inclusion criteria
In order to be included in this study, the journal had to satisfy the following criteria:
1. The journal is indexed by Scopus or SSCI.
2. The journal is within the second/foreign language learning and teaching area.
3. The journal presents original, quantitative research.
4. The journal uses English as its primary medium.
2 Journal search
The titles of journals indexed by Scopus and SSCI were examined against a list of
keywords representing various interests in the L2 field. Three experts were consulted
over two iterations to develop and validate the list of keywords (for the complete list,
see Appendix 1). A few well-known journals were not captured by our keywords (e.g.
System) and these were subsequently added. The final list of journals satisfying all
inclusion criteria included 30 journals (for the complete list, see Appendix 2). All jour-
nals were indexed in Scopus but only 19 were additionally indexed in SSCI. Our sam-
ple of 30 journals was larger than previous samples by Jung (2004, n = 12) and by
Benson et al. (2009, n = 10). It was slightly smaller than that by Egbert (2007, n = 35),
but this was primarily because our sample was limited to journals indexed in either
Scopus or SSCI. Therefore, it seems reasonable to argue that our sample is representa-
tive of journals in the L2 field.
Table 3 presents an overview of the journals in our sample and their citation analysis
scores. Both the Kolmogorov–Smirnov and Shapiro–Wilk tests for normality showed
that the four impact factors were normally distributed, ps > .05.
3 Data analysis
Five recent quantitative articles were randomly extracted from each of the 30 journals,
resulting in a total of 150 articles. All articles were published in 2016 or later, thus mak-
ing them representative of the latest quantitative trends in these journals. While a five-
article sample may not be fully representative of trends in a journal over time, our aim
was to investigate quantitative practices found in the most recent publications in the L2
734 Language Teaching Research 23(6)
field. This would allow us to find out the most common statistical violations in recent
literature. We discuss this issue further in the limitations of this study.
Each article was subsequently reviewed by two researchers independently against the
violations of statistical conventions described in Table 1. Controversial topics were
avoided, such as the adequacy of classical test theory vs. item response theory, or explor-
atory factor analysis vs. principal components analysis. In coding violations in each arti-
cle, repeated violations of the same issue were coded as one violation only, so violations
were coded in a binary format (present vs. not present) for each category. Interrater reli-
ability was high (96.7% agreement, κ = .91), and all discrepancies were resolved through
discussion until 100% agreement was reached.
VI Results
Violations were first averaged within each journal, and then the overall mean and stand-
ard deviation were computed (see Table 4). The table also reports the one-sample t-test
that examined whether the observed means were significantly different from zero. All
results were significant and lower than the Bonferroni-adjusted significance level (.05/7
= .007). The effect sizes were also generally substantial.
Since the maximum possible mean was 1.0 for each violation following our binary
coding, the means in the table can also be interpreted as a probability. As an illustration,
the probability of an article having issues with reliability is 24.7% (see Table 4). In other
words, almost one in every four articles would have a reliability issue. On the other hand,
almost every other article makes inferences from descriptive statistics (44%). Similarly,
just over one in three articles does not report effect sizes (38.7%).
Since the normality and linearity assumptions were satisfied, the correlations between
journals’ statistical quality and their citation analysis scores were examined to shed further
light on these results. There was a positive correlation between the statistical quality (vio-
lations were coded here so that a higher value indicated higher quality) of the 30 journals
and their Scopus citation analysis metrics (rSJR = .414, p = .023; rSNIP = .339, p = .067;
rCiteScore = .344, p = .062). In other words, SJR accounted for 17.1% of the variance
observed in journals’ statistical quality while SNIP and CiteScore accounted for 11.5%
and 11.2% of the observed variance, respectively. JCR impact factor was non-significant
for the 19 journals indexed by SSCI, r = –.129, p = .599, accounting for a negligible 1.7%
of the variance.
Table 3. Number of journals included and means and standard deviations of their 2015
citation analysis scores.
nImpact factor M (SD)
Scopus
30 SNIP 1.24 (0.78)
SJR 0.98 (0.77)
CiteScore 1.17 (0.84)
SSCI 19 JCR 1.42 (0.60)
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; SSCI = Social Sciences Citation Index.
Al-Hoorie and Vitta 735
Although the correlation between JCR and journal statistical quality was non-signifi-
cant, an independent samples t-test showed that non-SSCI-indexed journals had signifi-
cantly more violations (M = 11.27, SD = 3.35, n = 11) than SSCI-indexed ones (M = 7.89,
SD = 3.59, n = 19), t(28) = 2.54, p = .017, d = 0.97. These results suggest that SSCI-
indexed journals demonstrate higher quantitative rigor. Table 5 lists L2 journals indexed
by both Scopus and SSCI that demonstrated the fewest violations.1
VII Discussion
This article has presented the results of an analysis of 150 articles derived from a list of 30
journals representative of the L2 field. A number of statistical violations were observed
with varying degrees of frequency. The results also showed that Scopus citation analysis
metrics represent moderate predictors of statistical quality (accounting for around 11–
17% of the variance), with no evidence to favor the newly introduced CiteScore over
SNIP or SJR. Although these metrics account for less than 20% of the variance in the
observed quality of L2 journals, this magnitude might be considered reasonable, consider-
ing that statistical quality is only one dimension factoring in the overall quality of a jour-
nal. Other dimensions include non-statistical components of quantitative articles as well
as other types of articles such as qualitative and conceptual ones. Indeed, ‘a single method,
regardless of the number of components included, could not account for important differ-
ences among journals and in reasons for publishing in them’ (Egbert, 2007, p. 157).
The results also show that JCR was not a significant predictor of journal statistical
quality. This finding does not necessarily imply that this metric is not useful. It may mean
that L2 journals indexed by SSCI do not demonstrate sufficient variation for JCR to cap-
ture it. To become indexed by SSCI is typically a long process in which journals are
expected to demonstrate a certain level of quality. Indeed, the results above showed that
journals indexed in SSCI exhibited fewer statistical violations than non-SSCI journals.
Overall, these results suggest two ways to evaluate L2 journal quality: 1) the journal’s SJR
Table 4. Prevalence of the seven violations emerging from the analysis.
Theme M SD t p d
Psychometrics
1. Not reporting reliability 0.247 0.23 5.81 < .0001 1.06
2. Not discussing validity 0.087 0.13 3.81 .00066 0.70
Inferential statistics
3. Making inferences from descriptive
statistics
0.440 0.29 8.20 < .0001 1.50
4. Incomplete reporting, including non-
significant results
0.253 0.23 5.92 < .0001 1.08
5. Not reporting effect sizes 0.387 0.26 8.25 < .0001 1.51
6. Not adjusting for multiple comparisons 0.140 0.20 3.87 .00056 0.71
Other
7. Not reporting assumption checks 0.253 0.25 5.50 < .0001 1.00
736 Language Teaching Research 23(6)
value and 2) whether the journal is SSCI-indexed. The remainder of this article offers a
brief overview of the most common violations emerging from the present analysis.
1 Psychometrics
It is important for researchers to report details on the reliability and validity of their
instruments. In our analysis, a number of articles did not report the reliability of their
instruments, particularly dependent variables. In situations where manual coding is
involved, it is also important to examine and report interrater reliability as well. This
is especially important when a large amount of data is coded and subjectivity may be
involved.
When multiple scales are used (e.g. as part of a questionnaire), it is also important to
examine the factorial structure of the scales, whether using a procedure from classical
test theory or item response theory. It is common for researchers to adapt existing scales
for their own purposes but without conducting a factor analytic procedure to examine
convergent and discriminant validity of these scales. In some cases, even the original
developers of these scales did not investigate these issues. Some may argue that such
scales have been established in previous research. However, it would seem arbitrary to
require reporting reliability every time a scale is administered, but assume that other
psychometric characteristics can simply be imported from prior research. Reliability also
gives a very limited insight into the psychometric properties of a scale. Green et al.
(1977) offer examples of high reliability (e.g. over .80) that is a mere artifact of a long,
multidimensional scale, while Schmitt (1996) showed that low reliability (e.g. under .50)
is not necessarily problematic (see also Sijtsma, 2009, for a more detailed critique).
A discussion of validity is equally important. In our sample, a common situation
where a discussion of validity was lacking was in the use of authentic variables, such as
school grades. When the researcher draws from such authentic variables, it is typically
Table 5. Journals demonstrating highest quality in the present sample and their 2015 citation
analysis scores.
Journal Scopus SSCI
SNIP SJR CiteScore JCR
Computer Assisted Language Learning 1.54 1.26 1.64 1.72
English for Specific Purposes 2.73 1.66 2.11 1.14
International Review of Applied Linguistics in Language Teaching 0.97 0.91 0.95 0.80
Language Assessment Quarterly 0.63 1.07 0.93 0.98
Language Learning 2.54 2.47 2.58 1.87
Language Testing 1.36 1.44 1.50 0.91
Modern Language Journal 1.13 1.15 1.54 1.19
Studies in Second Language Acquisition 1.41 2.49 1.99 2.23
TESOL Quarterly 1.43 1.46 1.58 1.51
Notes. JCR = Journal Citation Reports Impact Factor; SJR = SCImago Journal Rank; SNIP = Source Normal-
ized Impact per Paper; SSCI = Social Sciences Citation Index.
Al-Hoorie and Vitta 737
out of the researcher’s hand to control for their reliability and validity. In such cases,
readers would at least need a detailed description of the variable, its characteristics, and
the circumstances surrounding its measurement in order to evaluate its adequacy for the
purpose of the article, such as whether grades are a fair reflection of proficiency in the
context in question. This information may also be helpful in resolving inconsistent results
when they emerge from different studies.
When researchers develop their own instruments, extra work is required. Instrument
development should be treated as a crucial stage in a research project. Researchers devel-
oping a new instrument should perform adequate piloting to improve the psychometric
properties of the instrument before the actual study starts in order to convince the reader
of its outcomes. Poor instruments risk misleading results.
2 Inferential statistics
One of the most common issues arising in our analysis was the tendency to make infer-
ences from descriptive statistics. It is important to be aware of the distinction between
descriptive statistics and inferential statistics. Descriptive statistics refer to the character-
istics of the sample in hand. These characteristics could potentially be idiosyncratic to
this specific sample and not generalizable to the population it was sampled from.
Inferential statistics help decide whether these characteristics are generalizable to the
population, primarily through a trade-off between the magnitude of the descriptive sta-
tistic (e.g. mean difference between two groups) and the size of the sample.
Descriptive statistics alone may be useful in describing some general trends. However,
in most cases, without inferential statistics it may not be clear whether the pattern
observed is genuine or resulting from chance. This applies to all descriptive statistics,
such as means, standard deviations, percentages, frequencies, and counts. Researchers
reporting any of these statistics should consider an appropriate inferential test before
making generalizations to the population. In certain situations, it might at first seem hard
to think of an appropriate inferential test, but it is the researcher’s responsibility to dem-
onstrate to the reader that the results are generalizable to the population. Ideally, the
decision on which test to use should be made at the design stage before conducting the
study (e.g. preregistration).
In our sample, we found three common situations where inferences were frequently
made without inferential statistics. The first was when only one sample was involved. In
such situations, the researcher might consider a one-sample test to tell whether the statis-
tic is significantly different from zero (as was done in the present study). In some cases,
this might be a mundane procedure, but it is a first step toward calculating the size of the
effect (see below), which is typically a more interesting question. The second situation
arising from our analysis had to do with count data. A number of articles reported counts
of certain phenomena (e.g. number of certain linguistic features in a corpus), and then
made inferences based on those counts. In such cases, the researcher might consider the
chi-square test for independent groups and the McNemar test for paired groups. Rayson,
Berridge, and Francis (2004) have also suggested the log likelihood test for comparing
observed corpus frequencies. The third, more subtle, situation was when researchers
compared two test statistics, such as two correlation coefficients. In these situations, the
738 Language Teaching Research 23(6)
two coefficients might be different but the question is whether this difference is itself
large enough to be statistically significant. In fact, even if one coefficient were signifi-
cant and the other were not, this would not be sufficient to conclude that the difference
between them would also be significant (see Gelman & Stern, 2006). In this case, Fisher’s
r-to-z transformation could be used.
Another issue arising from our results is incomplete reporting of results, including
non-significant ones. A number of articles presented their results in detail if they were
significant, but the non-significant results were abridged and presented only in passing.
Regardless of the outcome of the significance test, the results should be reported in full,
including descriptive statistics, test statistics, degrees of freedom (where relevant), p-val-
ues, and effect sizes. Failing to report non-significant results can lead to publication bias,
in which only significant result become available to the research community (Rothstein,
Sutton, & Borenstein, 2005). Failing to report non-significant results in full may also
preclude the report from inclusion in future meta-analyses.
In our analysis, we did not consider it a violation if confidence intervals were not
reported. Although there have been recent calls in the L2 field to report confidence inter-
vals to address some limitations of significance testing (e.g. Lindstromberg, 2016; Nassaji,
2012; Norris, 2015), this issue is less straightforward than it might seem at first. This is due
to the controversy surrounding the interpretation of confidence intervals. Since they were
first introduced (Neyman, 1937), confidence intervals have never been intended to repre-
sent either the uncertainty around the result, its precision, or its likely values. Morey,
Hoekstra, Rouder, Lee, and Wagenmakers (2016) refer to such interpretations as the fallacy
of placing confidence in confidence intervals, which is prevalent among students and
researchers alike (Hoekstra, Morey, Rouder, & Wagenmakers, 2014). As a matter of fact,
confidence intervals of a parameter refer to the interval that, in repeated sampling, has on
average a fixed (e.g. 95%) probability of containing that parameter. Confidence intervals
therefore concern the probability in the long run, and may not be related to the results from
the original study. Some statisticians have even gone as far as to describe confidence inter-
vals as ‘scientifically useless’ (Bolstad, 2007, p. 228). Whether the reader would agree with
this evaluation or consider it rather extreme, our aim is to point out that confidence inter-
vals are far from the unanimously accepted panacea for significance testing ills.
In contrast to confidence intervals, there is far more agreement among methodologists
on the need for reporting effect sizes to complement significance testing. The discussion
so far has mentioned significance and significance testing several times, probably giving
the impression of the meaningfulness of this procedure. Actually, a significant result is a
relatively trivial outcome. Because the null hypothesis is always false (Cohen, 1994),
just increase your sample size if you want a significant result! A p-value is the probability
of the data given the null hypothesis. When the result is significant at the .05 level, we
can conclude that the probability of obtaining this outcome by chance is only 5% if the
null hypothesis is true. It does not mean that the effect is big or strong. At the same time,
a non-significant result does not represent evidence against the hypothesis, as it is pos-
sible that the study was underpowered. To obtain evidence against a hypothesis, the
researcher needs to calculate the power of the test and then the effect size to demonstrate
that there is no ‘non-negligible’ effect (see Cohen et al., 2003; Lakens, 2013), or alterna-
tively use Bayes factors (see Dienes, 2014).
Al-Hoorie and Vitta 739
The final point discussed in this section is not adjusting for multiple comparisons. As
mentioned above, a p-value gives the probability of obtaining the data if the null is true.
It therefore does not refer to the probability of the hypothesis itself. With multiple com-
parisons, the likelihood of obtaining a significant result by mere chance no longer
remains at 5%, thus raising the risk of Type I error. One way to address this problem is
to implement an appropriate correction procedure (Larson-Hall, 2016; Ludbrook, 1998).
Another approach is to determine the specific tests to be conducted beforehand. Any
other analyses conducted should then be labeled explicitly as exploratory, since their
results could potentially reflect Type I error. Perhaps the worst a researcher could do in
this regard is to conduct various tests and report only those that turn out significant.
3 Other issues
In our analysis, a number of issues emerged related to not reporting assumption checks
before conducting particular procedures. This was placed in a separate category because
it is used here in a broad sense to refer to both descriptive and inferential statistics, as
well as psychometrics. For example, some articles used nonparametric tests due to non-
normality of the data but also reported the mean and standard deviation to describe their
data. Using the mean and standard deviation assumes that the data are normal. Many
articles also used inferential tests that require certain assumptions, such as normality and
linearity, but without assuring the reader that these assumptions were satisfied. Other
articles that performed factor analysis did not report factor loadings fully or address the
implications of cross-loadings. In many cases, many concerns can be addressed by sim-
ply making the dataset publicly available.
A particularly overlooked assumption is correlated errors. Many statistical procedures
assume that errors are uncorrelated. For example, when learners come from distinct
classes, learners from the same class will be more similar to each other than those from
different classes. When this happens, the sample is no longer random. As a consequence,
Type I error rate increases, such as when learners from one class in the group have higher
scores because of some unique feature of that class. In this case, the overall group mean
will be inflated because of one class only. The effect of violating this independence
assumption might be mild when there are only a few classes. But with more classes (e.g.
over 20), the effect could be more serious. One approach to deal with this situation is to
use classes as the unit of analysis by averaging the values for learners within each class.
Another approach is to use multilevel and mixed-effects modeling to model both higher
and lower units simultaneously (Hox, 2010).
VIII Conclusions
This article has presented a review of 30 journals representative of the L2 field. The review
focused on statistical issues specifically, rather than methodological issues. It was not
intended to downplay the importance of methodological issues, such as an adequate sample
size based on a priori power calculation or including a control group in (quasi)experimental
designs. Instead, we focused on statistical issues only because they seemed relevant to a
broader section the field, including areas fraught with practical constraints.
740 Language Teaching Research 23(6)
The results showed that Scopus’s citation analysis metrics function as moderate pre-
dictors of L2 journal quality, accounting for around 11–17% of the observed variance in
journals’ statistical quality, thus providing no evidence in favor of the newly introduced
CiteScore over the other metrics, at least in our field. Another indicator of journal quality
is whether the journal is SSCI-indexed. SSCI’s JCR was not a significant predictor of
journal quality, probably due to the small variation among SSCI-indexed journals, most
of which showing high quality in the L2 field. The analysis also revealed a number of
prevalent statistical violations that were surveyed in this article. Future research should
investigate other aspects of journal quality (i.e. other than statistical) to examine their
relationship with journal indexing and citation analysis metrics.
The present study is not without limitations. Our sample of 30 journals was rather
small. However, we were limited by the available number of L2 journals that are indexed
by Scopus and SSCI. For this reason, we did not have the luxury to conduct a power
analysis and then obtain a sufficiently large sample. Nevertheless, our study is still one
of the largest quantitative surveys of L2 journals in the field to date. A further limitation
is whether selecting five articles from a journal would be truly representative of that
journal. In our case, in addition to aiming to investigate the most recent quantitative
trends in journals, we were also bounded by practical constraints. A total of 150 journal
articles to read and analyse is no easy feat. Instead of recommending that future research-
ers must have a larger sample than ours, an alternative approach is to conduct compara-
ble studies on more recent literature and then combine the results meta-analytically. This
would help build a cumulative science of journal quality in the field.
Funding
This research received no specific grant from any funding agency in the public, commercial, or
not-for-profit sectors.
Note
1. We do not claim that other journals not listed in Table 5 necessarily have lower quality
because our sample was not exhaustive of journals in the field and because it included only
five recent articles from each journal. In fact, even for journals listed in Table 5, we do not
recommend researchers interested in improving their statistical literacy to browse older issues
of these journals, since quality (and editorial policies) change over time.
ORCID iDs
Ali H. Al-Hoorie https://orcid.org/0000-0003-3810-5978
Joseph P. Vitta https://orcid.org/0000-0002-5711-969X
References
Benson, P., Chik, A., Gao, X., Huang, J., & Wang, W. (2009). Qualitative research in language
teaching and learning journals, 1997–2006. The Modern Language Journal, 93, 79–90.
Bolstad, W.M. (2007). Introduction to Bayesian Statistics. 2nd edition. Hoboken, NJ: Wiley.
Brown, J.D. (1990). The use of multiple t tests in language research. TESOL Quarterly, 24, 770–773.
Brown, J.D. (2004). Resources on quantitative/statistical research for applied linguists. Second
Language Research, 20, 372–393.
Al-Hoorie and Vitta 741
Brumback, R.A. (2009). Impact factor wars: Episode V: The empire strikes back. Journal of Child
Neurology, 24, 260–262.
Chadegani, A.A., Salehi, H., Yunus, M.M., et al. (2013). A comparison between two main aca-
demic literature collections: Web of science and scopus databases. Asian Social Science, 9,
18–26.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997–1003.
Cohen, J., Cohen, J., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation
analysis for the behavioral sciences. 3rd edition. Mahwah, NJ: Lawrence Erlbaum.
Colledge, L., de Moya-Anegón, F., Guerrero-Bote, V., et al. (2010). SJR and SNIP: Two new
journal metrics in Elsevier’s Scopus. Serials, 23, 215–221.
da Silva, J.A.T., & Memon, A.R. (2017). CiteScore: A cite for sore eyes, or a valuable, transparent
metric? Scientometrics, 111, 553–556.
Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in
Psychology, 5. Available online at http://doi.org/10.3389/fpsyg.2014.00781 (assessed March
2018).
Egbert, J.O.Y. (2007). Quality analysis of journals in TESOL and applied linguistics. TESOL
Quarterly, 41, 157–171.
Garfield, E. (2006). The history and meaning of the journal impact factor. JAMA, 295, 90–93.
Gelman, A., & Stern, H.S. (2006). The difference between ‘significant’ and ‘not significant’ is not
itself statistically significant. The American Statistician, 60, 328–331.
Green, S.B., Lissitz, R.W., & Mulaik, S.A. (1977). Limitations of coefficient alpha as an index of
test unidimensionality. Educational and Psychological Measurement, 37, 827–838.
Guerrero-Bote, V.P., & Moya-Anegón, F. (2012). A further step forward in measuring journals’
scientific prestige: The SJR2 indicator. Journal of Informetrics, 6, 674–688.
Guz, A.N., & Rushchitsky, J.J. (2009). Scopus: A system for the evaluation of scientific journals.
International Applied Mechanics, 45, 351–362.
Hoekstra, R., Morey, R.D., Rouder, J.N., & Wagenmakers, E.-J. (2014). Robust misinterpretation
of confidence intervals. Psychonomic Bulletin & Review, 21, 1157–1164.
Hong-Nam, K., & Leavell, A.G. (2006). Language learning strategy use of ESL students in an
intensive English learning context. System, 34, 399–415.
Hox, J.J. (2010). Multilevel analysis: Techniques and applications. 2nd edition. New York:
Routledge.
Jung, U.O.H. (2004). Paris in London revisited or the foreign language teacher’s top-most jour-
nals. System, 32, 357–361.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practi-
cal primer for t-tests and ANOVAs. Frontiers in Psychology, 4. Available online at http://doi.
org/10.3389/fpsyg.2013.00863 (assessed March 2018).
Larson-Hall, J. (2016). A guide to doing statistics in second language research using SPSS and R.
2nd edition. New York: Routledge.
Larson-Hall, J. (2017). Moving beyond the bar plot and the line graph to create informative and
attractive graphics. The Modern Language Journal, 101, 244–270.
Larson-Hall, J., & Plonsky, L. (2015). Reporting and interpreting quantitative research findings:
What gets reported and recommendations for the field. Language Learning, 65, 127–159.
Leydesdorff, L., & Opthof, T. (2010). Scopus’s source normalized impact per paper (SNIP) versus
a journal impact factor based on fractional counting of citations. Journal of the American
Society for Information Science and Technology, 61, 2365–2369.
Lindstromberg, S. (2016). Inferential statistics in Language Teaching Research: A review and
ways forward. Language Teaching Research, 20, 741–768.
742 Language Teaching Research 23(6)
Loewen, S., & Gass, S. (2009). The use of statistics in L2 acquisition research. Language Teaching,
42, 181–196.
Loewen, S., Crowther, D., Isbell, D., Lim, J., Maloney, J., & Tigchelaar, M. (2017). The statisti-
cal literacy of applied linguistics researchers. Unpublished paper presented at the American
Association for Applied Linguistics (AAAL), Portland, Oregon, USA.
Loewen, S., Lavolette, E., Spino, L.A., et al. (2014). Statistical literacy among applied linguists
and second language acquisition researchers. TESOL Quarterly, 48, 360–388.
Ludbrook, J. (1998). Multiple comparison procedures updated. Clinical and Experimental
Pharmacology and Physiology, 25, 1032–1037.
Morey, R.D., Hoekstra, R., Rouder, J.N., Lee, M.D., & Wagenmakers, E.-J. (2016). The fallacy
of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23, 103–123.
Nassaji, H. (2012). Statistical significance tests and result generalizability: Issues, misconceptions,
and a case for replication. In G.K. Porte (Ed.), Replication research in applied linguistics (pp.
92–115). Cambridge: Cambridge University Press.
Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of prob-
ability. Philosophical transactions of the Royal Society of London, Series A: Mathematical
and Physical Sciences, 236, 333–380.
Norris, J.M. (2015). Statistical significance testing in second language research: Basic problems
and suggestions for reform. Language Learning, 65, 97–126.
Norris, J.M., Plonsky, L., Ross, S.J., & Schoonen, R. (2015). Guidelines for reporting quantitative
methods and results in primary research. Language Learning, 65, 470–476.
Oxford, R.L. (1990). Language learning strategies: What every teacher should know. New York:
Newbury.
Plonsky, L. (2013). Study quality in SLA: An assessment of designs, analyses, and reporting prac-
tices in quantitative L2 research. Studies in Second Language Acquisition, 35, 655–687.
Plonsky, L. (2014). Study quality in quantitative L2 research (1990–2010): A methodological
synthesis and call for reform. The Modern Language Journal, 98, 450–470.
Plonsky, L. (Ed.) (2015). Advancing quantitative methods in second language research. New
York: Routledge.
Plonsky, L., & Gass, S. (2011). Quantitative research methods, study quality, and outcomes: The
case of interaction research. Language Learning, 61, 325–366.
Plonsky, L., & Oswald, F.L. (2014). How big Is ‘big’? Interpreting effect sizes in L2 research.
Language Learning, 64, 878–912.
Radwan, A.A. (2011). Effects of L2 proficiency and gender on choice of language learning strate-
gies by university students majoring in English. The Asian EFL Journal, 13, 115–163.
Rayson, P., Berridge, D., & Francis, B. (2004). Extending the Cochran rule for the comparison
of word frequencies between corpora. Unpublished paper presented at the 7th International
Conference on Statistical analysis of textual data (JADT 2004), Louvain-la-Neuve, Belgium.
Rothstein, H.R., Sutton, A.J., & Borenstein, M. (Eds.). (2005). Publication bias in meta-analysis:
Prevention, assessment and adjustments. Chichester: Wiley.
Schmitt, N. (1996). Uses and abuses of coefficient alpha. Psychological Assessment, 8, 350–353.
Sijtsma, K. (2009). On the use, the misuse, and the very limited usefulness of Cronbach’s alpha.
Psychometrika, 74, 107–120.
Stevens, J. (2009). Applied multivariate statistics for the social sciences. 5th edition. New York:
Routledge.
Tabachnick, B.G., & Fidell, L.S. (2013). Using multivariate statistics. 6th edition. Boston, MA:
Pearson.
Tseng, W.-T., Dörnyei, Z., & Schmitt, N. (2006). A new approach to assessing strategic learning:
The case of self-regulation in vocabulary Acquisition. Applied Linguistics, 27, 78–102.
Al-Hoorie and Vitta 743
Vitta, J.P., & Al-Hoorie, A.H. (2017). Scopus- and SSCI-indexed L2 journals: A list for the Asia
TEFL community. The Journal of Asia TEFL, 14, 784-792.
Weiner, G. (2001). The academic journal: Has it a future? Education Policy Analysis Archives, 9.
Available online at http://doi.org/10.14507/epaa.v9n9.2001 (assessed March 2018).
Author biographies
Ali H. Al-Hoorie is assistant professor at the English Language Institute, Jubail Industrial College,
Saudi Arabia. He completed his PhD degree at the University of Nottingham, UK, under the super-
vision of Professors Zoltán Dörnyei and Norbert Schmitt. He also holds an MA in Social Science
Data Analysis from Essex University, UK. His research interests include motivation theory,
research methodology, and complexity.
Joseph P. Vitta is active in TESOL/Applied Linguistics research with interests and publications in
lexis, curriculum design, research methods, and computer-assisted language learning. As an ELT
professional, he has over 12 years’ experience as both a program manager and language teacher.
Appendix 1
Keywords
CALL, computer assisted language learning, EAP, English for academic purposes, EFL, English
as a foreign language, ELL, English language learner, ELT, English language teaching, ESP,
English for specific purposes, FLA, Foreign language acquisition, foreign language, language
acquisition, language assessment, language testing, language classroom, language curriculum, lan-
guage education, language educator, language learning, language learner, language learners, lan-
guage proficiency, language teaching, language teacher, language teachers, second language,
SLA, second language acquisition, TEFL, teaching English as a foreign language, TESL, teaching
English as a second language, TESOL, teaching English to speakers of other languages, teaching
English.
Appendix 2
Journals
1. Applied Linguistics
2. Asian EFL Journal
3. Asian ESP Journal
4. CALL-EJ
5. Computer Assisted Language Learning
6. Electronic Journal of Foreign Language Teaching
7. ELT Journal
8. English for Specific Purposes
9. Foreign Language Annals
10. Indonesian Journal of AL
11. Innovation in Language Learning and Teaching
12. International Review of Applied Linguistics in Language Teaching
13. Iranian Journal of Language Teaching Research
744 Language Teaching Research 23(6)
14. Journal of Asia TEFL
15. Journal of English for Academic Purposes
16. Journal of Second Language Writing
17. Language Assessment Quarterly
18. Language Learning
19. Language Learning & Technology
20. Language Learning Journal
21. Language Teaching Research
22. Language Testing
23. Modern Language Journal
24. ReCALL
25. Second Language Research
26. Studies in Second Language Acquisition
27. System
28. Teaching English with Technology
29. TESOL Quarterly
30. JALT CALL Journal
... The results should also be reported in detail and should include descriptive statistics, such as standard deviations, pvalues, and effect sizes for all variables that are of potential relevance or interest (Norris et al., 2015). Meta-analyses on the statistical quality of studies suggest that this has not always been done appropriately and highlight the need for more attention to statistical rigor and detail in L2 research (Al-Hoorie & Vitta, 2019;Plonsky, 2013). Tullock and Ortega (2017) go one step further and urge quantitatively oriented SA researchers to coordinate their research efforts by consistently applying comparable measures and data collection techniques and by consistently controlling modulating variables. ...
Chapter
This chapter explores quantitative research in second language study abroad (SA) contexts. It sketches out common research objectives, methods, and design options without attempting to take stock of all possible approaches to quantitative study abroad research (SAR). Instead, it raises seven issues that quantitative SAR is confronted with, explores how to tackle these, and makes recommendations for future studies. Three of the recommendations made concern the needs to (1) conduct multi-year studies to increase studies’ sample size and statistical power, (2) investigate interrelationships between SA participants’ performance in specific skills-tests and overall language proficiency tests, and (3) complement or replace self-report data through naturalistic observation of social and communicative behavior using tools like the Electronically Activated Recorder.
Article
Full-text available
Questionnaires have been widely used in second language (L2) research. To examine the accuracy and trustworthiness of research that uses questionnaires, it is necessary to examine the validity of questionnaires before drawing conclusions or conducting further analysis based on the data collected. To determine the validity of questionnaires that have been investigated in previous L2 research, we adopted the argument-based validation framework to conduct a systematic review. Due to the extensive nature of the extant questionnaire-based research, only the most recent literature, that is, research in 2020, was included in this review. A total of 118 questionnaire-based L2 studies published in 2020 were identified, coded, and analyzed. The findings showed that the validity of the questionnaires in the studies was not satisfactory. In terms of the validity inferences for the questionnaires, we found that (1) the evaluation inference was not supported by psychometric evidence in 41.52% of the studies; (2) the generalization inference was not supported by statistical evidence in 44.07% of the studies; and (3) the explanation inference was not supported by any evidence in 65.25% of the studies, indicating the need for more rigorous validation procedures for questionnaire development and use in future research. We provide suggestions for the validation of questionnaires.
Article
Full-text available
BACKGROUND Accurate assessment of the quality of academic journals is of great significance. While Journal Impact Factor (JIF), calculated by Clarivate and based upon the Web of Science literature database, and CiteScore (CS), developed by Elseiver and based upon the Scopus database, have enjoyed high uptake worldwide, efforts continue towards creation of other scientometric indexes that will provide ever-greater qualitative insights into journal impact. Such efforts have yielded the newly-launched Journal Article Influence Index (JAII), which is based on the Reference Citation Analysis (RCA) database, an open multidisciplinary citation analysis database based on artificial intelligence technology. AIM To evaluate and summarize the similarities and differences between JAII and JIF/CS as journal evaluation indicators, and provide an intuitive method for visual representation of the related data. METHODS We searched the Journal Citation Reports to obtain the 2021 JIF list, downloaded the CS list updated in July on the Scopus website, and collected the comprehensive list of 2022 JAIIs from the RCA database (www.referencecitationanalysis.com). RESULTS Our research results revealed that by breaking through the time limit of mainstream journal evaluation methods, the JAII is able to perform well in data reliability, establishing its benefit as a complementary scientometric index to JIF and CS. CONCLUSION JAII provides comprehensive assessment of the quality and performance of journals.
Article
Full-text available
This article discusses the influence of historically disadvantaged background on the culture of reading in some primary school learners from a school district. Investigation for this article was administered through a qualitative research approach, assisting in attaining first-hand information directly from the participants, thereby generating nonnumerical data. Embedded in this qualitative investigation was a case study design. As qualitative research concentrates on acquiring a comprehended understanding of how individuals perceive lived experiences, the main purpose of entrenching a case study was to dig deep into the in-depth descriptions coupled with the personal experiences of the subjects. It draws from semi-structured interviews conducted with primary school language teachers. The interview schedule specifically designed for this inquiry contained open-ended question types. During interviews, recordings were made in their natural settings through interacting with each participant. Data coding and analysis were informed by the iterative approach. The main findings of this investigation indicate that (i) teaching reading remains one of the basic skills in learning but was (ii) compromised by the lack of reading material, stemming from the disadvantaged background of the studied schools. Also, though motivation by parents seems to yield good results, there seemed to be (iii) a lack of influence and intervention strategies regarding available resources in the learners’ homes. I argue that family background does correlate (have an impact on learner reading ability) with learners’ reading ability. I conclude and propose that teachers need to employ teaching and learning methods that accommodate various cultural notions learners bring to school, as this is likely to impart positively on their academic performance.
Article
Full-text available
Educators are increasingly implementing holistic approaches into the foreign language classroom, with one popular method being mindfulness. Mindfulness has been associated with better grades and well-being in higher education. However, no study has yet explored the language-specific relationship between mindfulness and second language acquisition (SLA). This study investigated the relationship between the mindfulness level of 269 Japanese undergraduates and their ability to learn new L3 words in German. The study further examined the extent to which different domains of mindfulness were related to the ability to learn new vocabulary, as well as the relationship between mindfulness and the different vocabulary test parts (receptive, productive, grammatical gender). Correlation and multiple regression analyses revealed that higher mindfulness was associated with better vocabulary retention. In particular, the mindfulness dimension “Acting with Awareness” was key for all connections. Notably, the L1 and L3 parts showed a different relationship with different mindfulness dimensions: “Non-judging” was crucial for the productive part and “Observing” was decisive for the receptive part. These findings indicate that mindfulness not only improves general academic performance, but also has a direct relationship with SLA. Finally, this suggests that mindfulness training for students may be a novel approach to facilitate vocabulary learning.
Preprint
Full-text available
Self-determination theory is one of the most established motivational theories both within second language learning and beyond. This theory has generated several mini-theories, namely: organismic integration theory, cognitive evaluation theory, basic psychological needs theory, goal contents theory, causality orientations theory, and relationships motivation theory. After providing an up-to-date account of these mini-theories, we present the results of a systematic review of empirical second language research into self-determination theory over a 30-year period (k = 111). Our analysis of studies in this report pool showed that some mini-theories were well-represented while others were underrepresented or absent from the literature. We also examined this report pool to note trends in research design, operationalization, measurement, and application of self-determination theory constructs. Based on our results, we highlight directions for future research in relation to theory and practice.
Article
This paper introduces an online application developed to assist with statistical data analysis. While currently limited in scope to t tests, correlation, and regression analysis, the application moves beyond standard implementation of these tests by augmentation with bootstrapped statistics and several exploratory and explanatory graphical plots. Measures of effect size and confidence intervals are also calculated. The application is designed to be easy to use, requiring only input of correctly formatted data for the analyses to be run. After providing a brief introduction and rationale for the application, issues related to quantitative data analysis underpinning its analysis are outlined. Guidance on the functionality of the application is also provided. Finally, some caveats regarding its usage are addressed. Whilst aimed primarily at researchers with limited experience in quantitative data analysis, it is hoped the application may also be useful for a broad range of users. 本論文では、統計データ解析を支援するために開発されたオンラインアプリケーションを紹介する。現在のところ、t検定、相関、回帰分析に限定されているが、ブートストラップや探索的、説明的なグラフプロットで拡張することで、これらの検定の標準的な実装を超えるアプリケーションである。該アプリケーションを用れば、効果量と信頼区間も計算され、使いやすく設計されているために正しくフォーマットされたデータを入力するだけで分析が実行される。本論文では、アプリケーションの簡単な紹介と理論的根拠を示した後、解析の基礎となる定量的データ解析に関する問題を概説する。また、アプリケーションの機能に関するガイダンスと、使用の際の注意事項を提供する。定量データ解析の経験が浅い研究者を主な対象としているが、幅広いユーザーにとって有用なアプリケーションとなることを願っている。
Article
Purpose This paper aims to analyse through a bibliometric study the academic literature that relates entrepreneurship to foods. Design/methodology/approach A database of 1,300 papers published in the ISI Web of Science was generated. The bibliometric techniques allowed us to describe scientific literature evolution, most productive authors, institutions and countries, most relevant sources and documents, trend topics and social structure. Findings The results illustrate an upward trend, more accentuated in the last four years, in publishing papers relating entrepreneurship to the food industry. Originality/value This research is novel because although numerous articles relate the food industry to entrepreneurship, no bibliometric articles that analyse the scientific production that relates both terms have been found in the literature.
Chapter
Methodological transparency constitutes a central tenet of the open science movement that is sweeping across many disciplines. Drawing on the burgeoning meta-science that has investigated methodological practices in applied linguistics, particularly in the area of second language learning and teaching, this chapter outlines key characteristics of methodological transparency, focusing on the reporting and availability of materials, data, coding, and analysis procedures. The chapter summarizes empirical evidence about some of the negative consequences of a lack of methodological transparency, such as how it severely weakens our capacity to understand, evaluate, and replicate research. Whilst noting a number of important challenges ahead, the chapter highlights key practices and infrastructure that are now available to researchers, institutions, funders, and editors to promote a more collaborative, sustainable, and replicable research effort.
Article
Background/Purpose: Research papers provide the path for the expansion of knowledge, allowing for the emergence of new information and the avoidance of duplication of previous research effort. Researchers use their paper to convey their findings to the rest of the world. However, owing to problems such as processing fees from journals, an inefficient review procedure, a lengthy processing time, and the closure of journals, the research community has encountered a major hurdle. In order to overcome this issue, we propose in this study a system known as the Open Platform for Research Article Sharing (OPRAS), which would allow the author to publish his or her paper utilizing a peer-to-peer (P2P) architecture. Objective: To get an understanding of the research article publication process and the issues it faces, in order to propose OPRAS, a new system that will employ a peer-to-peer architecture to share research articles. Design/Methodology/Approach: Data from websites, research papers, and other sources are collected, analysed and presented using ABCD analysis. Findings/Results: With the help of research articles that are published in journals, researchers can tell the world about their work. But authors can't publish their findings because of different problems, which makes it hard for society to learn new things. In this paper, a method for a new system called Open Platform for Research Article Sharing (OPRAS) utilizing the P2P architecture was put forward after looking at the different steps and problems that come up when publishing an article. Originality/Value: A new system has been presented based on the relevance of research articles and an understanding of the problems in publishing research papers, which will attract the attention of the research community and lead to additional improvements in the proposed technique. Paper Type: Research Paper.
Article
Full-text available
In many Asia TEFL contexts, emphasis in academic promotions and tenure is placed on journals listed in Scopus and the Social Sciences Citation Index (SSCI). However, these indexing services do not offer a subcategory specific to second/foreign language (L2) research and practice, apart from rather generic categories such as linguistics and education. To address this gap, this brief report attempts to construct a comprehensive list of L2 Scopus- and SSCI-indexed journals. In this article, we present this list and the two-stage process we followed to obtain it.
Article
Full-text available
This study investigates the use of language learning strategies by 128 students majoring in English at Sultan Qaboos University (SQU) in Oman. Using Oxford's (1990) Strategy Inventory for Language Learners (SILL), the study seeks to extend our current knowledge by examining the relationship between the use of language learning strategies (LLS) and gender and English proficiency, measured using a three-way criteria: students' grade point average (GPA) in English courses, study duration in the English Department, and students' perceived self-rating. It is as well a response to a call by Oxford to examine the relationship between LLSs and various factors in a variety of settings and cultural backgrounds (see Oxford, 1993). Results of a one-way analysis of variance (ANOVA) showed that the students used metacognitive strategies significantly more than any other category of strategies, with memory strategies ranking last on students' preference scale. Contrary to the findings of a number of studies (see e.g., Hong-Nam & Leavell, 2006), male students used more social strategies than female students, thus creating the only difference between the two groups in terms of their strategic preferences. Moreover, ANOVA results revealed that more proficient students used more cognitive, metacognitive and affective strategies than less proficient students. As for study duration, the results showed a curvilinear relationship between strategy use and study duration, where freshmen used more strategies followed by juniors, then seniors and sophomores, respectively. Analysis of the relationship between strategy use and self-rating revealed a sharp contrast between learners who are selfefficacious and those who are not, favoring the first group in basically every strategy category. To find out which type of strategy predicted learners' L2 proficiency, a backward stepwise logistic regression analysis was performed on students' data, revealing that use of cognitive strategies was the only predictor that distinguished between students with high GPAs and those with low GPAs. The present study suggests that the EFL cultural setting may be a factor that determines the type of strategies preferred by learners. This might be specifically true since some of the results obtained in this study vary from results of studies conducted in other cultural contexts. Results of this study may be used to inform pedagogical choices at university and even pre-university levels.
Article
To date, the journal impact factor (JIF), owned by Thomson Reuters (now Clarivate Analytics), has been the dominant metric in scholarly publishing. Hated or loved, the JIF dominated academic publishing for the better part of six decades. However, a rise in the ranks of unscholarly journals, academic corruption and fraud has also seen the accompaniment of a parallel universe of competing metrics, some of which might also be predatory, misleading, or fraudulent, while others yet may in fact be valid. On December 8, 2016, Elsevier B.V. launched a direct competing metric to the JIF, CiteScore (CS). This short communication explores the similarities and differences between JIF and CS. It also explores what this seismic shift in metrics culture might imply for journal readership and authorship.
Article
Graphics are often mistaken for a mere frill in the methodological arsenal of data analysis when in fact they can be one of the simplest and at the same time most powerful methods of communicating statistical information (Tufte, 2001). The first section of the article argues for the statistical necessity of graphs, echoing and amplifying similar calls from Hudson (2015) and Larson–Hall & Plonsky (2015). The second section presents a historical survey of graphical use over the entire history of language acquisition journals, spanning a total of 192 years. This shows that a consensus for using certain types of graphics, which lack data credibility, has developed in the applied linguistics field, namely the bar plot and the line graph. The final section of the article is devoted to presenting various types of graphic alternatives to these two consensus graphics. Suggested graphics are data accountable and present all of the data, as well as a summary structure; such graphics include the scatterplot, beeswarm or pirate plot. It is argued that the use of such graphics attracts readers, helps researchers improve the way they understand and analyze their data, and builds credibility in numerical statistical analyses and conclusions that are drawn.
Article
This article reviews all (quasi)experimental studies appearing in the first 19 volumes (1997–2015) of Language Teaching Research (LTR). Specifically, it provides an overview of how statistical analyses were conducted in these studies and of how the analyses were reported. The overall conclusion is that there has been a tight adherence to traditional methods and practices, some of which are suboptimal. Accordingly, a number of improvements are recommended. Topics covered include the implications of small average sample sizes, the unsuitability of p values as indicators of replicability, statistical power and implications of low power, the non-robustness of the most commonly used significance tests, the benefits of reporting standardized effect sizes such as Cohen’s d, options regarding control of the familywise Type I error rate, analytic options in pretest–posttest designs, ‘meta-analytic thinking’ and its benefits, and the mistaken use of a significance test to show that treatment groups are equivalent at pretest. An online companion article elaborates on some of these topics plus a few additional ones and offers guidelines, recommendations, and additional background discussion for researchers intending to submit to LTR an article reporting a (quasi)experimental study.
Article
It is common to summarize statistical comparisons by declara- tions of statistical significance or nonsignificance. Here we discuss one problem with such declarations, namely that changes in statistical significance are often not themselves statistically significant. By this, we are not merely making the commonplace observation that any particular threshold is arbitrary---for example, only a small change is required to move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even large changes in significance levels can correspond to small, nonsignificant changes in the underlying quantities. The error we describe is conceptually different from other oft-cited problems---that statistical significance is not the same as practical importance, that dichotomization into significant and nonsignificant results encourages the dismissal of observed dif- ferences in favor of the usually less interesting null hypothesis of no difference, and that any particular threshold for declaring significance is arbitrary. We are troubled by all of these concerns and do not intend to minimize their importance. Rather, our goal is to bring attention to this additional error of interpretation. We illustrate with a theoretical example and two applied examples. The ubiquity of this statistical error leads us to suggest that students and practitioners be made more aware that the difference between "significant" and "not significant" is not itself statistically significant.
Article
Traditions of statistical significance testing in second language (L2) quantitative research are strongly entrenched in how researchers design studies, select analyses, and interpret results. However, statistical significance tests using p values are commonly misinterpreted by researchers, reviewers, readers, and others, leading to confusion regarding the actual findings of primary studies and critical challenges for the accumulation of meaningful knowledge about language learning research. This paper outlines the basic challenges of accurately calculating and interpreting statistical significance tests, explores common examples of incorrect interpretations in L2 research, and proposes strategies for resolving these problems.
Article
This paper presents a set of guidelines for reporting on five types of quantitative data issues: (1) Descriptive statistics, (2) Effect sizes and confidence intervals, (3) Instrument reliability, (4) Visual displays of data, and (5) Raw data. Our recommendations are derived mainly from various professional sources related to L2 research but motivated by results from investigations into how well the field as a whole is following these guidelines for best methodological practices, and illustrated by L2 examples. Although recent surveys of L2 reporting practices have found that more researchers are including important data such as effect sizes, confidence intervals, reliability coefficients, research questions, a priori alpha levels, graphics, and so forth in their research reports, we call for further improvement so that research findings may build upon each other and lend themselves to meta‐analyses and a mindset that sees each research project in the context of a coherent whole.
Article
Adequate reporting of quantitative research about language learning involves careful consideration of the logic, rationale, and actions underlying both study designs and the ways in which data are analyzed. These guidelines, commissioned and vetted by the board of directors of Language Learning, outline the basic expectations for reporting of quantitative primary research with a specific focus on Method and Results sections. The guidelines are based on issues raised in: Norris, J. M., Ross, S., & Schoonen, R. (Eds.). (2015). Improving and extending quantitative reasoning in second language research. Currents in Language Learning, volume 2. Oxford, UK.