ArticlePDF Available

Learner Corpus Research Meets Chinese as a Second Language Acquisition: Achievements and Challenges

Authors:

Abstract

The article sheds light on Chinese as a Second Language Learner Corpus Research, emphasising advances and lacks in this field. First, the paper describes the potential of learner corpora in the investigation of learner language. Second, it provides an overview of Chinese learner corpus-based research and reviews existing L2 Chinese learner corpora. The paper highlights the lack of L2 Chinese learner corpora collecting data from Italian learners. Therefore, it emphasises the importance of compiling corpora to conduct studies on the acquisition of L2 Chinese by learners whose L1s are other than English or Asian languages.
e-ISSN 238 5-3042
Annali di Ca’ Foscari. Serie orientale
Vol. 58 – Giug no 2022
1
Peer review
Submitted 2022-02-21
Accepted 2022-04-07
Published 2022-06-30
Open access
© 2022 | bc Creative Commons Attribution 4.0 International Public License
Citation Iurato, A. (2022). “Learner Corpus Research Meets Chinese as a Sec-
ond Language Acquisition: Achievements and Challenges”. Annali di Ca’ Foscari.
Serie orientale, 58, [1-34].
Edizioni
Ca’Foscari
DOI 10.30687/AnnOr/2385-3042/2022/01/024
Learner Corpus Research Meets
Chinese as a Second Language
Acquisition: Achievements
and Challenges
Alessia Iurato
Università Ca’ Foscari Venezia, Italia; Universit ät Bremen, Deutschland
Abstract The article sheds light on Chinese as a Second Language Learner Corpus
Research, emphasising advances and lacks in this field. First, the paper describes the
potential of lear ner corpora in t he investigatio n of learner language. S econd, it provide s
an overview of Chinese learner corpus-b ased research and reviews ex isting L2 Chinese
learner corpora. The paper highlights the lack of L2 Chinese learner corpora collecting
data from It alian learners. Th erefore, it emphasi ses the impor tance of compiling corp ora
to conduct s tudies on the acqui sition of L2 Chines e by learners wh ose L1s are other t han
English or Asian languages.
Keywords Learner corpus research. Chinese as a second language acquisition. Cor-
pus linguistics. L 2 Chinese learner corpora. Learner corpus constr uction.
Summar y 1 Introduct ion. – 2 The Def inition of ‘Learner Cor pus’ and the Specif icity of
Learner Co rpus Data. – 3 Potentials and Bene fits of Learner C orpora. – 4 Learner Corp us
Research a nd Second Language A cquisition: Have They Ever M et? – 5 Developme nt and
Achievements of Learner Corpus Research. – 6 Chinese as a Second Language Learner
Corpus Research. – 7 L2 Chinese Learner Corpora. – 7.1 Review of L2 Chinese Learner
Corpor a. – 7.2 L2 Chines e Learners’ Input Co rpora. – 8 Ongoing Research in CS L Learner
Corpus Construction. – 9 Concluding Remarks.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
2
1 Introduction
The area of linguistic enquir y known as ‘Learner Corpus Research’
(LCR) originated in the late 1980s and it has mushroomed in recent
years, since interest in computer learner corpora is growing fast,
as well as the recognition of their theoretical and practical value
(Granger 2002; 2012).
LCR has created an important link between the two previously
distinct elds of corpus linguistics and foreign/second language re-
search. LCR started as “an oshoot of corpus linguistics” (Granger,
Gilquin, Meunier 2015b, 1), a eld of study that had revealed enor-
mous potential in investigating a wide range of native languages, but
that had never explored non-native varieties.
LCR applies the paradigm of corpus linguistics to a wide range of
(research) purposes in foreign/second language acquisition (Grang-
er 2002; 2012). It is for this reason that LCR and corpus linguistics
share a set of common features (Meunier 2021).
Over the last few decades, the eld of linguistics has experienced
an important paradigm shift from the study of language as an ab-
stract mental representation to the study of language in its actual use
(Zhang, Tao 2018). Corpus linguistics analysis has facilitated and sup-
ported this signicant transition, as it makes it possible to systemati-
cally study patterns of language use through the investigation of large
electronically stored and automatically processed collections of lan-
guage samples (Zhang, Tao 2018). Established as an oicially inde-
pendent discipline in 1960, corpus linguistics began to expand in the
1990s, developing itself concurrently with the advancement of com-
putational technology (Brezina, McEnery 2021). Compared to other
linguistic approaches, corpus linguistics has attracted widespread in-
terest among scholars because it displays several strengths relating
to the support of computational tools. First, it bases linguistic analy-
sis on naturally occurring data rather than mental abstract intuition;
second, corpus data are empirical, constituting an important resource
for interpreting patterns of language use in natural contexts; third, it
utilises a large collection of texts, making it easy to provide informa-
tion about the frequency of occurrence of linguistic features; nal-
ly, it analyses large collection of texts which allow to compare dier-
ent varieties of a language or languages (Brezina, McEnery 2021). As
summarised by McEnery et al. (2019), there are many advantages of
using corpora, such as the shareability of data sets, which promotes
re-use of data to develop further research, and the greater scale of
analysis, which allows to draw broader conclusions.
The above-mentioned features lead corpus linguistics to establish
itself as an increasingly reliable research approach, which became
widely adopted in linguistic studies. It goes without saying that all
those strengths also apply to LCR.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
3
The present article will not give an overview of corpus linguistics
research, as this is beyond its scope; sources for a more in-depth study
of the subject can ba found in the literature reviews by Biber, Con-
rad and Reppen (1998), Brezina and McEnery (2021), McCarthy and
O’Keee (2010b), McEnery and Hardie (2011), McEnery and Wilson
(1996), Sinc lair (1991), Stubbs and Halbe (2012), Togni n i Bonelli (2010).
Following the denition provided by Brezina and McEnery (2021,
11) of corpus linguistics as “an approach to the study of language that
uses computers to analyse large amounts of language data, both writ-
ten and spoken, [called] corpora”, learner corpus linguistics can be
dened as a linguistic methodology which is founded on the use of
electronic collections of learners’ data (written and spoken), which
we call ‘learner corpora’.
The rationale of this paper is to identify achievements in LCR
gained over the last thirty years, focusing on advances, challeng-
es, pedagogical implications, and future directions of Chinese as a
Second language Learner Corpus Research (CSL LCR). In detail, the
rst section of the paper will explore the development of LCR as an
independent eld; LCR core issues, potentials, and limits will be in-
vestigated as well. Furthermore, the study will answer the question
whether LCR and Second Language Acquisition (SLA), despite both
partaking to the broader eld of L2 studies, have nally met. The
second section of the paper will highlight how, in spite of the fact
that in the last few years in Italy there has been a sudden and signif-
icant increase in the teaching of Chinese at all levels, from school
to university, due to a widespread interest from learners in this lan-
guage, research on Chinese acquisition in the Italian context is not
very developed, partly because of little general scientic interest in
this eld until a few years ago. Moreover, the study will show that
in the Italian scientic scenario, although there has been an indis-
putable growth in studies on the acquisition of Chinese, there is a
lack of research applying the rigorous LCR methodology. The paper
will therefore discuss the necessities of compiling L2 Chinese cor-
pora which collect data from L1 Italian learners. This issue will be
addressed by presenting recent attempts in the scientic communi-
ty which incorporate the LCR methodology with research on L2 Chi-
nese acquisition by Italian-speaking learners.
2 The Definition of ‘Learner Corpus’ and the Specificity
of Learner Corpus Data
As a specic type of corpora, computer learner corpora are dened
by Granger (2002, VII) as “electronic collections of spoken or written
texts produced by foreign or second language learners in a variety of
language settings”. Analogously, Barlow (2005, 335) states that learn-
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
4
er corpora are “digital representation of the performance or output
[…] of language learners”. In Stewart, Bernardini and Aston’s words
(2004, 2), “learner corpora consist of writing or speech produced by
language learners, or of materials written for language learners”.
McEnery, Xiao and Tono (2006, 5) dened a corpus in corpus lin-
guistics as a
collection of machine-readable authentic texts (including tran-
scripts of spoken data) which is sampled to be representative of a
particular language or language variety.
To follow up on McEnery, Xiao and Tono’s (2006) denition, Meuni-
er (2021, 23) claimed that
a learner corpus is thus a specic type of corpus which […] can
broadly be dened as a collection of machine-readable texts con-
sisting in representative samples of the language written and/or
spoken by learners of an additional language.
As Gilquin (2015) summarises, the main characteristic that distin-
guishes a learner corpus from any other corpus is that it represents
language as produced by foreign or second language learners. On the
other hand, what distinguishes it from the data used in the previous
SLA studies is that is representative of learner language use. These
two distinctive features of learner corpora have led researchers to
provide a more detailed denition of learner corpora as “systematic
collections of texts produced by language learners” (Nesselhauf 2004,
125), where ‘systematic’ in Nesselhauf’s words (2004, 127) means that
the texts included in the corpus were selected on the basis of a
number of – mostly external – criteria (e.g. learner level(s), learn-
ers’ L1(s) [mother tongue(s)]) and that the selection is represent-
ative and balanced.
Agreeing with this denition, Callies and Götz add that learner cor-
pora can be dened as “systematic collections of authentic, continu-
ous, and contextualized language use (spoken or written) by L2 learn-
ers stored in electronic format”, by stressing that “language samples
should be representative of learners’ contextualized use” (Callies,
Götz 2015, 3). In this respect, Granger emphasises the two essential
criteria for learner corpus data, i.e. the length of language samples
and the context in which the language is produced:
the notion of ‘continuous text’ lies at the heart of corpushood. A se-
ries of decontextualized words or sentences produced by learners,
while being bona de learner production data, will never qualify
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
5
as learner corpus data. In addition, it is best to restrict the term
‘learner corpus’ to the most open-ended types of tasks, viz. those
tasks that allow learners to choose their own word ordering rath-
er than being requested to produce a particular word structure.
(Granger 2008, 261)
In other words, learner corpus data are intended to be produced in
open-ended tasks that allow learners to choose their own wording in
spoken or written composition (Callies 2015a).
When dening learner corpora, a thorny issue is stating their de-
gree of authenticity and naturalness, as
[l]earner corpora represent a (more or less) naturalistic kind of
data, collected with no (or very little) control over what learners
say or write. (Gilquin 2021, 133)
This statement reveals that, as generally reported in the literature of
LCR (Gilquin 2015; 2021; Granger 2002; 2012; Meunier 2021, among
others), it is inaccurate to refer to learner corpora as collections of
fully natural data. The concept of authenticity is indeed tricky in the
case of learner language (Granger 2002). Learner language occur-
ring in learner corpora is meant to be as authentic as possible (Me-
unier 2021); however, Granger’s (2008, 260) description of learner
corpora as “electronic collections of (near-)natural foreign or second
language learner texts assembled according to explicit design cri-
teria” suggests that corpora may include texts that are not natural-
ly occurring texts. So, the denition of learner corpora as “authen-
tic” collection of learners’ data (Callies, Götz 2015, 3) stresses that
the language produced by learners is meant to be considered merely
situationally and interactionally authentic in the context of the SLA
classroom (Callies, Götz 2015).
In this respect, Granger (2008, 259) points out that
the term near-natural is used to highlight the “need for data that
reects as closely as possible ‘natural’ language use”.
So, it can be inferred that “[t]he content of a learner corpus will,
more often than not, be exactly those activities which are natural in
the context of a second language classroom” (Gilquin, Gries 2009,
7), such as role plays, speaking and reading activities, writing, etc.
As stated by Granger (2002, 8),
[i]n relation to learner corpora, the term ‘authentic’ […] covers dif-
ferent degrees of authenticity, ranging from ‘gathered from the
genuine communication of people going about their normal busi-
ness’ to ‘resulting from authentic classroom activity’.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
6
It is in fact acknowledged that some degree of articiality is always
involved in the language teaching context and that learner data are
therefore rarely fully natural.
From this it follows that studies collecting purely spontaneous
oral or written learners’ production in LCR are fairly rare, as learn-
ers’ data cannot easily be spontaneously collected. This is because,
for learners (especially foreign language learners), the target lan-
guage fulls only a limited number of functions, most of which are
restricted to the classroom context. (Gilquin 2015, 10)
In fact, when learners are engaged in one of the required activities,
such as writing a composition or role-playing with their classmates,
they focus on practicing what they have learned to improve their lan-
guage prociency, rather than simply conveying a not premeditated
message. Consequently, data collected under these circumstances
cannot be considered the authentic linguistic output of “people going
about their normal business” (Gilquin 2015, 10), as it is usually the
case with fully natural corpus data. So, although according to Ellis
(1994) learner corpus data fall within the more open-ended types of
SLA data as natural language use data, and although they are sup-
posed to be ‘authentic’, because they contain data gathered from
the genuine communications of people doing their regular business,
fully natural learner data is diicult to collect, especially in for-
eign language settings which give learners few opportunities to
use the L2 in authentic everyday situations. (Granger 2012, 8)
Following this train of thought, Granger (2012) points out that it is
counterproductive to analyse naturalistic data because of their draw-
backs, such as: a) impossibility of exploring some specic language
features because of the scarcity of the data; b) lack of control of cer-
tain factors exter nal or inter nal to the lear ner that may aect learn-
ers’ production; c) diiculty in the interpretation of the data.
In light of all the above, the traditional denition of what counts as
a learner corpus needs to be expanded and renewed (Tracy-Ventura,
Myles 2015). What is dened as a learner corpus has always been a
hot topic in the SLA and LCR community, and in the past literature
there is no general agreement on its denition. However, currently
SLA and LCR communities unanimously agree that corpora can no
longer be intended as fully authentic collections of learners’ data just
because they contain natural language use data produced by learn-
ers who use the L2 for authentic communication purposes.
Nowadays, both communities consider a learner corpus as a collec-
tion of computerised continuous, spontaneous, contextualised, rep-
resentative (near-)natural (written or spoken) data produced by for-
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
7
eign or L2 learners, and gathered through those activities which are
ordinarily carried out in the teaching and learning of second/for
-
eign languages.
3 Potentials and Benefits of Learner Corpora
Similarly to other categories of corpora, learner corpora also allow
to search through and analyse millions or even billions of words, a
task that would be almost unmanageable without the existence of ad-
equate computational technology. As highlighted by Leech (2011, 7),
[i]f asked what is the one benet that corpora can provide and
that cannot be provided by other means, I would reply ‘informa-
tion about frequency’.
Leech highlights the great contribution of using corpora: they allow
us to obtain information about the frequencies of occurrence of lin-
guistic features and the contexts of their use. In the specic case of
learner corpora, we can observe frequency and distribution of lin-
guistic features in learner language use in language sampled in the
corpora. This perspective of analysis is undoubtedly unique to (learn-
er) corpus analysis (Brezina, McEnery 2021). Learner corpus linguis-
tics, similarly to corpus linguistics, is able to answer research ques-
tions as the following:
Is the linguistic feature of interest underrepresented or fre-
quently distributed in learner language use?
Which is the frequency rate of the linguistic feature of interest
in learner language use?
What are the typical collocations of the word of interest?
What are the typical contexts of use in which the word of inter-
est generally occurs?
Learner corpora allow us to answer the above mentioned questions
not only in terms of quantitative analysis, but also in terms of quali-
tative analysis: this combination is the real strength of learner cor-
pus linguistics (Brezina, McEnery 2021). Corpora provide information
on how typical or unusual a linguistic feature is in learner language
as sampled by the corpus (quantitative approach), and they simulta-
neously inform us about the context in which it occurs and, most of
the time, the language background and the metadata of the learner
(qualitative approach). In learner corpus linguistics, therefore, these
two approaches of analysis are complementary and strongly connect-
ed: qualitative information is often a starting point for further quan-
titative analysis, whereas, to have a deeper and more correct inter-
pretation of quantitative analysis, it is often necessar y to re-analyse
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
8
the text from which the data were extracted and engage in qualita-
tive analysis (Brezina, McEnery 2021).
Although LCR has developed as a branch of corpus linguistics, we
can state that nowadays it is an independent eld of study (Grang-
er 2021b). In the past thirty years, a considerable body of literature
on LCR has been published, analysing a wide range of linguistic is-
sues, from lexical to grammatical topics (Zhang, Tao 2018). The inau-
guration of the biannual International Conference of Learner Corpus
Research in 2011, the foundation of the Learner Corpus Association
in 2013, and the publication of the Handbook of Learner Corpus Re-
search in 2015 all attest to the increasing relevance of LCR as an
autonomous eld of study and the worldwide growth of the LCR re-
search community.
4 Learner Corpus Research and Second Language
Acquisition: Have They Ever Met?
Although LCR and SLA studies both fall into the wider eld of L2
studies, “it must be acknowledged that they are still essentially two
dierent worlds” (Granger 2021a, 243).
The analysis of learner data is not new in linguistics. Written and
spoken data have always been collected and investigated in SLA stud
-
ies (Granger, Gilquin, Meunier 2015b). However, for a long time, in
the eld of SLA data were rather articial, as they were collected by
means of highly controlled tasks. Therefore, data in SLA were not
regarded as a realistic reection of learners’ oral communication
skills. Moreover, since data were always collected in small quanti-
ties, they were lacking in representativeness and statistical reliabili-
ty. The need to overcome these theoretical and methodological gaps,
as well as th e nee d to crea te mor e “lea r ner - awar e / lea r n er-focus ped-
agogical tools” (Granger, Gilquin, Meunier 2015b, 1), encouraged the
emergence of learner corpora.
Dierently from previous data collections analysed in SLA re-
search, working on electronic collections of L2 data has brought two
main advantages (Granger, Gilquin, Meunier 2015b). First, as these
data collections are usually very big and collect data from a large
number of participants, they are arguably more representative than
smaller data collections gathered from a smaller number of students.
Second, as the data are computerised, the analysis procedures are
faster, and the data can be analysed for dierent research purpos-
es. Part-of-speech (POS) taggers, for instance, assign each word in
the learner corpus a tag labelling its grammatical category, facili-
tating the study of learners’ use of specic grammatical categories,
such as adverbs or prepositions. As for the analysis of errors, it is
possible to add error annotation by means of specic software tools
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
9
which allow to identify errors labelled with the same tag in a very
large corpus in a short time.
The original project behind the development of major learner cor-
pora was
to enrich existing corpus collections with learner varieties and
pass on the advances made in computerized corpus linguistics to
applied linguistics. (Le Bruyn, Paquot 2021b, 1)
This placed the roots of LCR outside the domain of theory-driven
SLA research.
However, the expectation that learner corpora could have become
a relevant resource for theory-driven SLA research is not borne out
(Le Bruyn, Paquot 2021b). Early learner corpus research was met
with scepticism by the SLA community. Bell, Collins and Marsden
(2021, 235) attribute LCR’s scarce popularity within SLA research
to its “preoccupation for coding errors, L1 transfer errors, and de-
viations from a target-like norm”, at a stage when SLA research had
already moved beyond descriptive generalisations.
Despite LCR has been the target of erce criticism by the eld of
SLA since the 1990s, over the time LCR has improved consistent-
ly, by rening its theories and techniques, and SLA has recognised
its developments (Granger 2012; 2021a; Meunier 2021). In the last
few years, LCR and SLA have started to interact, nonetheless Myles
(2015; 2021) notes that there is not a real systematic collaboration.
Granger (2012; 2021a) has often highlighted the usefulness of LCR
methodological approach and software tools for SLA research, as well
as the importance of SLA theory for the analysis of learner corpora:
[i]t is now time that corpus linguists and SLA specialists work
more closely, since the few studies that have used LCR [method-
ology] to test an SLA hypothesis demonstrate the potential of a
more SLA-informed approach. (Granger 2012, 8)
As stressed by Tracy-Ventura and Myles (2015), corpora and corpus
linguistics techniques should be an integral part of the toolkit of eve-
ry SL A speci a l ist devo ted to the research and ana l ysis of sec ond / for -
eign language development. There are at least two main reasons why
SLA experts should convert to and appreciate the use of corpus lin-
guistics techniques. First, most of the current SLA hypotheses were
based on research involving a small number of students. Therefore,
ndings cannot be considered generalizable and statistically reliable.
Secondly, the use of electronic corpora could streamline and speed
up the research process (Myles 2015). Moreover, corpus tools could
allow SLA specialists to consult a vast amount of data for dierent
research purposes. For example, as reported by Tracy-Ventura and
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
10
Paquot (2021b), the use of ‘regex’1 would allow for the extraction of
linguistics patterns. Corpus tools can also be adopted to retrieve
lexical bundles and collocations, which are the starting point for the
study of formulaic language in learner language. The use of a con-
cordance would allow to retrieve at once all instances of linguistic
items of interest from an annotated learner corpus, thus reducing
the duration of the research process. Furthermore, spoken learner
corpora could be a valuable resource for the study of oral language
development: transcriptions can be searched so that linguistic items
of interest can be rapidly located.
LCR studies have been regularly criticised for being merely de-
scriptive (Granger 2021a; Myles 2021). It is a common tendency to
believe that studies in the eld of learner corpus linguistics mainly
provide statistical analyses, without dealing with adequate interpre-
tation of the data. It must be emphasised that LCR, actually, recog-
nises and attaches signicative importance to the interpretation of
ndings. This theme was one of the most debated topics during the
last Graduate Student Conference in Learner Corpus Research 2021,
which took place in October 2021 at the Inland Norway University
of Applied Sciences. On that occasion, the President of the Learner
Corpus Association repeatedly stressed the need to restore the right
balance between statistical analysis and data interpretation (Grang-
er 2021b). In fact, only by placing the methodological paradigm of
the learner corpus linguistics at the service of acquisitional studies
will we be able to truly enhance the potential of this methodology.
Now, the main question is: in the reality of L2 studies, have LCR
and SLA really met? As stated by Granger (2021a), the convergence
has not yet been achieved. Myles (2015, 309) agrees and states that
second language researchers have been rather slow in taking ad-
vantage of learner corpora and their associated computerized
methodologies […], and LCR is not always fully informed by SLA
research.
The reason why the limitation of LCR has been, and arguably still is,
the lack of theoretical interpretations in the analysis of data is be-
cause it is mainly corpus linguists that started research activity in
the eld of LCR (Granger 2012). This can be considered positive, as
they were able to adapt corpus linguistics techniques to the analy-
sis of learners’ data, by designing new corpora according to strict
criteria which have been revised to meet the needs of LCR. This pro-
cess required experience and corpus expertise. However, the down-
1
‘Regex ’ or ‘regexp’, short for ‘regula r expression’, is a sequence of s ymbols and ch ar-
acters expressing a st ring or pattern to be searched for within a longer piece of text.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
11
side is that, as corpus specialists, and not as SLA specialists, they
neglected the interpretations of data based on SLA theories; there-
fore, their research was mainly descriptive, relatively limited to the
illustration of the corpus data, and lacking in theoretical frameworks.
Although there is still no real synergy between the two elds, they
clearly have a lot to learn from each other:
L2 studies would benet greatly if SLA researchers were more fa-
miliar with the research carried out in LCR, and vice versa, re-
sulting in more cross-referencing of each other’s work in their re-
spective publications. (Granger 2021a, 254)
SLA would provide the theoretical foundation, which is usually lack-
ing in LCR; LCR, on the other hand, would oer descriptions of learn-
er language use from a wide variety of L1 backgrounds at dierent
prociency levels. It is important that future works move towards
this direction, to the mutual benet of the two elds.
5 Development and Achievements
of Learner Corpus Research
LCR emerged approximately 30 years ago thanks to the noteworthy
research work conducted by Sylviane Granger and her research team
at the Catholic University of Louvain, in Belgium (Zhang, Tao 2018;
Granger 2021b). In the late 1980s, she conceived the idea of the Cen-
tre for English Corpus Linguistics (CECL), which gave rise to the cre-
ation of learner and multinational corpora especially for pedagogi-
cal purposes. Since its foundation, the Centre has produced fourteen
corpora, some of which are amongst the largest of their type and col-
lected data from numerous countries (Gráf 2017).
Granger’s eorts also led to the creation of two new learner cor-
pus methodologies: Contrastive Interlanguage Analysis (CIA) (Grang-
er 2002) and Computer-aided Error Analysis (CEA) (Granger 2002).
CIA is based on the combination of two types of comparison: non-na-
tive speakers/non-native speakers (NNS/NNS) comparison and native
speakers/non-native speakers (NS/NNS) comparison. The rst con-
sists in comparing learners’ data from dierent learner populations
and dierent L1 backgrounds; the latter involves the comparison of
learners’ data with native speakers’ data (Granger 2002). CEA, on
the other hand, consists in “devising a standardized system for error
tags and tagging all the errors in a learner corpus” (Granger 2002,
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
12
14). CEA is dierent from previous Error Analysis2 studies because it
is computer-aided, involves a higher degree of standardisation, and
because learner “errors are presented in the full context of the text,
alongside non-erroneous forms” (Granger 2002, 13).
Unsurprisingly, the earliest data sets collected data from learners
of L2 English with dierent L1 language backgrounds (Granger 2012;
Zhang, Tao 2018; Gráf 2017). The rst to be compiled, and one of the
most notable examples of ‘commercial learner corpora’, is the Long-
man Learner Corpus (Gilquin 2015; Gillard, Gadsby 1998). Its compi-
lation was followed by the launch of the most renowned learner cor-
pus: the International Corpus of Learners (ICLE) (Granger 2003). The
ICLE was created in the 1990s by Sylviane Granger and her associ-
ates at the Catholic University of Louvain and collects written pro-
ductions by intermediate and advanced L2 English learners. As a re-
sult of international cooperation work, the ICLE contains 3.7 million
words produced by over 3,000 learners of L2 English with sixteen L1
dierent backgrounds. Moreover, in the ICLE more than 20 tasks and
learner variables are documented (Granger et al. 2009). The corpus
has been designed and compiled with the aim of developing analysis
of high-frequency linguistic phenomena at the morphological, gram-
matical, lexical, and discourse levels (Granger 2003).
The Cambridge Learner Corpus (CLC) (Nicholls 2003) was com-
piled to support English language teaching publishers to produce a
wide range of learning tools, such as dictionaries and course books.
The NUS Corpus of Learner English (NUCLE) (Dahlmeier, Ng, Wu
2013) was established for the annotation and evaluation of grammat-
ical error correction systems.
Following this lead, a consistent number of learner corpora have
gradually developed for the analysis of other European languages.
Originally, studies were limited exclusively to the analysis of data of
English learners, given the role of English as the major lingua fran-
ca of the world (Granger 2002; 2012). However, in recent decades,
an increasing number of L2s has been the subject of studies in the
eld of LCR, which has therefore experienced an exponential growth.
The Learner Corpora Around the World database,3 managed by the
2 Error Analysis is a type of linguistic analysis that focuses on the err ors appearing
in lear ner language. It consists of a compar ison between the error s made in t he target
language and that target lang uage itself. Er ror analysis emphasises the signi cance
of learners’ errors in second la nguage. It determines whet her those error s are system-
atic, and (i f possi ble) expla in what caused them. For a detailed di scussion of the topic,
see Corder 1975; El lis 1985; 1987; 1994; Ldeling, Hirschmann 2015; R ichar ds 1980;
Wallace Robinet t, Schachter 1983.
3 The Lear ner Corpora Around the World dat abase is searchable at: h t t p s :// u c l o u -
vain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.
html .
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
13
Centre for English Corpus Linguistics of the University of Louvain,
currently collects 190 learner corpora, 100 (52.6%) representing L2
English, the rest focusing on other target languages (Italian, Span-
ish, French, German, Korean, Finnish, Arabic, Portuguese, Russian,
etc.). Unfortunately, this database currently counts only two avail-
able L2 Chinese learner corpora: the Jinan Learner Corpus ( JCLC)4
(Wang, Malmasi, Huang 2015) and the Spoken Chinese Corpus of In-
formal Interaction,5 which is not available for public use. In addition
to the two above-mentioned, a review of existing L2 Chinese learn-
er corpora that are not currently included in the database of Learn-
er Corpora Around the World will be provided later in this paper.
6 Chinese as a Second Language Learner Corpus Research
Studies on corpus linguistics have developed considerably in Chi-
na in the last decades. This paper, however, will merely focus on the
development of Chinese as a Second Language Learner Corpus Re-
search (CSL LCR).6
In Chinese, ‘learner corpus’ is identied as xuéxízhě yǔliào 学习
者语料库 (learner corpus) or zhōngjièyǔ yǔliào 中介语语料库 (inter-
language corpus). Works in CSL LCR started in the late 1990s and
have ourished over the past fteen years (Zhang, Tao 2018). The un-
stoppable increase in the construction of corpora of L2 Chinese has
led to a parallel exponential growth in the acquisitional studies of L2
Chinese over the last decade. Scholars in this eld have produced a
large body of studies exploring the acquisition of L2 Chinese at dif-
ferent levels from dierent perspectives. Another important indica-
tor of the progress of this eld is the establishment of the biennial
CSL corpus research conference series The International Symposi-
um of Chinese Interlanguage Corpora: Construction and Application
(Zhang, Tao 2018; Xu 2019). The symposium rst convened in Bei-
jing in 2012, and the related conference proceedings, published by
the Journal of Chinese Language Teachers Association, report nd-
ings on CSL learner corpora construction and their application. This
research tradition is further strengthened by the rst International
4 The JCLC is sea rcha ble in the list of t he Learner Cor pora Around the World data-
base at: https://uclouvain.be/en/research-institutes/ilc/cecl/learner-cor-
pora-around-the-world.html.
5 The Spoken Chinese Corpus of Infor mal Interact ion is searchable in the l ist of the
Learner Corpora Around the World database at: h ttps://uclo uvain.be/en/rese arch -
institutes/ilc/cecl/learner-corpora-around-the-w orld.html.
6 For over views of Chinese corpus linguistics, see the comprehensive overv iews on
the topic i n Basciano, Gatti, Morbiato 2020; Feng 2006; McEnery, Xiao 2016; Xu 2015;
Zhan et a l. 2006.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
14
Conference on Corpora of Chinese Spoken Interlanguage, which rst
was convened in 2015 (Zhang, Tao 2018). On both occasions, schol-
ars’ discussions focused on Chinese interlanguage corpora and relat-
ed issues, such as research on the construction of Chinese interlan-
guage corpora, corpus-based research on the acquisition of Chinese
sentence patterns and syntax, and corpus-based research on the ac-
quisition of Chinese characters and words.
7 L2 Chinese Learner Corpora
The past twenty years have seen an exponential boom in Chinese
learner corpus-based studies, due to China’s growing global inu-
ence and the resulting increase of Teaching Chinese as a Second Lan-
guage courses (Xu 2019). From this it follows that
the construction of Chinese interlanguage corpora has become
very popular in the wake of the augmented enrolment of interna-
tional learners of Chinese. (44)
The central subject of CSL LCR has been the description of learn-
er language, with a particular focus on learner errors (Zhang, Tao
2018). In fact,
[e]arly learner corpora, such as the L2 Chinese Interlanguage
Corpus and the HSK Dynamic Composition Corpus […], were on-
ly tagged for learner errors rather than language in its totality.
Therefore, early CSL LCR used error analysis as its primary ana-
lytical framework. (50)
The earliest research was exclusively focused on the description of the
taxonomy of errors; scholars limited their analysis to identifying the
canonical taxonomies of underuse, overuse, and misuse of a target lin-
guistic feature. Thanks to the use of large-scale learner corpora, they
were also able to extract information on the frequency rate of learn-
er errors, and then they tried to provide interpretations of ndings
based on quantitative analysis. However, as stressed by Tono (2003),
it was necessary to move from pure descriptive taxonomies to explor-
ing motivations for interlanguage errors. In order to provide an ade-
quate picture of learner language use, later research began studying
both errors and correct usages of linguistic features. Nowadays, as
Zhang and Tao (2018, 50) claim, “[t]he norm of current CSL LCR is to
look at language in its totality”. Therefore, current research (Zhang
2010; Zhang 2014, among others), for example, examine the acquisi-
tion of specic linguistic features not only looking at learner errors,
but also comparing the frequency of use of these linguistic features
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
15
by learners with dierent L1s. Findings from investigations of pat-
terns of acquisition and development, as well as of individual varia-
tion in learner language use, over the last few years have helped re-
searchers and teachers, among other things, to understand learners’
abilities, challenges, and developmental trajectories (Lu, Chen 2019).
7.1 Review of L2 Chinese Learner Corpora
As this is a growing eld, it is increasingly diicult to be kept up-
to-date and fully informed of the vastness of learner corpus projects
around the world. A review of existing L2 Chinese learner corpora
will be provided, with the aim to be as comprehensive as possible,
including the major and the small-scale corpora found during the re-
search carried out so far.
The rst Chinese learner corpus project was born in a totally au-
tonomous way from the corpus linguistics research carried out in
Europe and USA. It is the L2 Chinese Interlanguage Corpus (Hànyǔ
zhōngjièyǔ yǔliàokù xìtǒng 汉语中介语语料库系), constructed from
1993 to 1995 by the research team led by Chu Chengzhi and Chen
Xiaohe (Chu, Chen 1993; Chu et al. 1995) at the Beijing Language In-
stitute, now Beijing Language and Culture University (BLCU). The
corpus, as the rst interlanguage Chinese data set, includes 5,774 es-
says written by 1,365 CSL learners from 96 dierent countries study
-
ing L2 Chinese at nine universities in China. As for the corpus size, it
consists of 3,528,988 Chinese characters. The written data are POS
tagged, parsed, and error-annotated. The data are also supplement-
ed by rich ethnographic learners’ metadata, documenting learners’
sociolinguistic variables. Moreover, the corpus allows users to search
by character, single word, sentence, discourse levels, or by learner
metadata (Zhang, Tao 2018). Unfortunately, this corpus is not avail-
able for public use.
In the late 1990s and early 2000s, the aforementioned L2 Chinese
Interlanguage Corpus was followed up by “the hitherto most frequently
cited L2 Chinese learner corpus” (Xu 2019, 45), i.e. the HSK7 Dynam-
ic Composition Corpus (HSK dòngtài zuòwén yǔliàokù HSK 动态作文
语料库) (Zhang 2003). The Corpus Version 1.0, launched in 2006, was
compiled by the International Research and Development Center for
Chinese Education at BLCU. This rst version contained 10,740 com-
positions, including 4 million characters. Over the years, the corpus
has been expanded and renewed, and in 2008 the new Corpus Version
7 The HSK test (Hànyǔ Shuǐpíng Kǎoshì 汉语水平考试) is the Ch inese language pro-
ciency test of Mainland China for non-native speaker s such as foreign students a nd
overseas Chinese.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
16
1.1 was launched. Xun Endong from the School of Information Science
at BLCU provided the data of version 1.1. The corpus included 11,569
essays with 4.24 million characters. Since the technology of the Cor-
pus Version 1.1 had become obsolete, the team has recently worked
on the construction of version 2.0, which is now available.8 The Cor-
pus Version 2.0 retained all the data of Version 1.1. The overall corpus
design and the development of the last software system are currently
in charge of the team led by Zhang Baolin at BLCU, in collaboration
with Beijing Weishu Technology Co., Ltd. Among the new features of
the Corpus Version 2.0, there is the possibility of developing graphs
for statistical analyses. Moreover, users can also add and edit error
annotations to the corpus. As for learners’ data, the corpus collects
essays written by L2 Chinese lea r ner s who took the HSK Chi nese lan -
guage prociency test in the period between 1992 and 2005. The gen-
re of the essays is mainly narrative or argumentative. As reported by
Zhang and Tao (2018), 88.81% of contributors to the corpus is from an
Asian region or country, and 64% of the data is collected from Korean
and Japanese learners. Learners’ metadata are added to each compo-
sition; they include information on the learner’s prole (such as age,
nationality, L1 background) and the results of the reading test, the
listening test, the written test, the spoken test, accompanied by the
HSK total score and the certicate awarded. The error annotation is
also added to the corpus at the levels of punctuation, character, lex-
icon, grammar, and discourse (Zhang, Tao 2018). The HSK Dynamic
Composition Corpus (Version 2.0) is online, freely available for pub-
lic use; once registered, users can start their research and check the
scanned copies of learners’ original compositions.9
In addition to the team working at BLCU, which can be considered
the leader in L2 Chinese learner corpus research (Xu 2019), other re-
search teams in the eld have constructed their own distinctive L2
Chinese learner corpus projects. For example, National Taiwan Nor-
mal University (NTNU) has developed three important learner cor-
pora: the Chinese Character Errors Corpus, the Chinese as a Second
Language Spoken Corpus, and the TOFCL Learner Corpus.
The Chinese Character Errors Corpus (CCEC) (Teng et al. 2007),
is the rst and “arguably the earliest” (Xu 2019, 45) learner corpus
collecting data to merely analyse learner errors in the writing of
traditional characters. In fact, only error annotation is added to the
corpus. The corpus collects data from 124 students at beginner, inter-
mediate, and advanced levels from 22 dierent countries and from 15
8
The HSK Dy na mi c Co mp ositi on Cor pus ( Ver sion 2.0 ) is searchable on line at: h t t p: //
yuy anziyu an.blcu.ed u.cn/en/info/1043/1501.htm.
9 Infor mation about the HSK Dynamic Composition Cor pus (Version 2.0) is available
at: ht t p://yu yan zi yu a n.blc u.e du.cn/e n/in fo/1 04 3/15 01.h tm .
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
17
dierent L1 back g r oun ds. Si nce the main contr ibutor s to other ex i s t-
ing L2 Chinese learner corpora were Japanese and Korean learners,
the CCEC decided to exclude learners whose L1s were Japanese and
Korean (Zhang 2013). The error annotation consists of tagging mis-
spellings, which are categorised into nine codes. The scanned ver-
sion of the erroneous characters is stored and attached to the cor-
pus. As reported by Xu (2019, 46),
[t]he National Taiwan Normal University corpus […] might be dis-
qualied as a corpus because its size is too small, and only indi-
vidual characters, rather than running texts, were recorded in
the database.
This corpus is not available for public use.
A similar corpus collecting misspelled Chinese characters, but in
their simplied form, is the Continuity Corpus of Chinese Interlan-
guage of Charac t er- Error Sys tem (Han zi Pianw u Biaozhu de Hanyu Li-
anxuxing Zhongjieyu Yuliaoku 汉字偏误标注的汉语连续性中介语语料库)
(Zhang 2017), which was developed at Sun Yat-sen University in
Guangzhou.
10
It includes written texts, which were tokenised and
POS tagged. Misspelled characters are tagged and, similarly to the
CCEC, the image les of the original hand-written texts are stored
along-side each entry in the corpus.
The Chinese as a Second Language Spoken Corpus consists of a
collection of 450 learners’ data gathered from the standard Manda-
rin language prociency test, namely Test of Chinese as a Foreign
Language (TOCFL), that has been adopted in Taiwan. Learners are
grouped into basic and advanced prociency levels, and their L1
backgrounds are English, Japanese and Korean. In total, the cor-
pus contains 450 tests with 773,000 characters (Zhang, Tao 2018).11
The TOCFL Learner Corpus (Chang 2013) is the rst learner cor-
pus of traditional Chinese annotating grammatical errors (Lee,
Tseng, Chang 2018). It contains written essays that students com-
pleted for the TOCFL test collected since 2016. The learners’ data
are accompanied by rich metadata including information on their L1,
their CEFR level,12 and information relating to the text genre, text
function, text length, and score. Contributors to the learner corpus
10 The Hanzi Pianwu Biaozhu de Hanyu Lianxuxing Zhongjieyu Yuliaoku corpus is
available at: https://languageresources.github.io/2018/06/24/述承_汉字偏误
注的汉语连 续性中介 语语料库/.
11 The Chinese as a Second Language Corpus is available online at:
http://140.122.83.243/mp3c.
12 The Common Europea n Framework of Reference for Languages (CEFR) is an in-
ternational standard for descri bing language ability. It descr ibes language abil ity on
a six-poi nt scale (A1, A2, B1, B2, C1, C2), from A1 for beginner s, up to C2 for those who
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
18
are from 42 dierent L1 backgrounds (mainly Japanese, followed
by English, Vietnamese, Korean and Indonesian) (Zhang, Tao 2018).
The corpus consists of 5,092 essays, with 1,740,000 characters and
1,140,000 words. A total of 33,835 grammatical errors in 2,837 es-
says and their corresponding corrections have been manually anno-
tated by the team to analyse inappropriate linguistic usages (Lee,
Tseng, Chang 2018). Two dierent error sets of error classications
were used simultaneously to tag grammatical errors: the rst set
denotes the coarse-grained surface dierences, while the […]
[second set] denote[s] the ne-grained linguistic category. The
course-grained error types originate from comparing errone-
ous sentences with the correct usages. […] The ne-grained er-
ror types focus on representing linguistic concepts. (Lee, Tseng,
Chang 2018, 2299)
The annotation was developed by Chinese native-speaking annota-
tors specically trained to follow the team’s annotation guidelines
for the error-tagging task. As for the coarse-grained level, four error
types are identied (missing, redundant, incorrect selection, and in-
correct word ordering); whereas at the ne-grained linguistic cate-
gory level, 36 error types were catalogued (Lee, Tseng, Chang 2018).
Once annotated, the team formatted the data in four sections: essay,
learner, text, and mistake (Lee, Tseng, Chang 2018). The TOCFL is
publicly available online to facilitate further research.13
Recently, some Chinese interlanguage projects have assumed the
goal of compiling more balanced learner corpora covering both spo-
ken and written interlanguage. The Mandarin Interlanguage Cor-
pus (MIC) (Tsang, Yeung 2012) and the Guangwai-Lancaster Chinese
Learner Corpus (GWLCLC)14 are two cases in point.
The MIC (Tsang, Yeung 2012) is a smal l-sca le lear ner corpus com-
piled at the University of Hong Kong. The corpus collects written and
spoken data from pre-intermediate to intermediate Mandarin learn-
ers from dierent L1s. Both written and spoken production were col-
lected in the form of coursework and examinations, amounting a to-
tal of approximately 50,000 characters and 60 hours of oral output.
The MIC tags the errors at the character level to avoid “the situa-
tions where errors receive dierent treatments by the research team
have mastered a language. The TOCFL Learner Corpus groups lear ners into four dif-
ferent prociency levels (A 2, B1, B2, C1) following the CEFR st andards (Cheng 2013).
13
The TOCFL Learner Corpus is available online at: ht tp://nlp.e e.n c u.e du.tw/r e-
so ur ce/to cfl.ht ml .
14 Inform ation about the GWLCLC can be found online at: https://w ww.sketchen-
gine.eu/guangw ai-lancaster-chinese-learner-corpus/.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
19
and the user, and as a result do not turn up in the research” (Tsang,
Yeung 2012, 190). The aims of the MIC project are: identifying and
tracking both written and spoken language patterns from Mandarin
learners of dierent L1s; facilitating research comparing features
among learners of dierent L1s and possibly dierent prociency
levels; enhancing the development of teaching and assessment ma-
terials of L2 Chinese Teaching (Tsang, Yeung 2012).15
The GWLCLC16 is a written and spoken L2 Chinese corpus collect-
ing data produced by 886 learners from 80 dierent countries stud-
ying at Guangdong University of Foreign Studies (GDUFS) in China.
Learners are classied into beginner, intermediate, and advanced pro-
ciency levels, according to the HSK Chinese Prociency Test score
standards. The GWLCLC was built by Hai Xu and his team at GDUFS,
in collaboration with Vaclav Brezina at Lancaster University. Original-
ly, the project was initiated by Richard Xiao, whose vision was to bring
corpus linguistics to the analysis of L2 spoken and written Chinese.
The funding for the corpus was obtained by Xiao to whom the corpus
is also dedicated. The corpus currently consists of 1,2 million words.
It has both a spoken (621,900 tokens, 48%) and a written (672,328 to-
kens, 52%) part, and it is fully error tagged. Metadata are also incor-
porated in the corpus. The data (spoken and written) were collected
from exams and tutorial sessions carried out at GDUFS. The written
texts consist of short pieces and essays on a given topic. Spoken data
comprise interactions typically between native and non‐native speak-
ers of Chinese and involve one, two, or multiple speakers. The corpus
is a balanced sample that can be used to explore various theoretical
and practical issues pertaining to the acquisition of Chinese as a for-
eign language. The corpus is searchable online on Sketch Engine.17
Two recently compiled large-scale written corpora are also worth
mentioning, i.e. the Jinan Chinese Learner Corpus (JCLC) (Wa ng, Ma l-
masi, Huang 2015) and the Yet Another Chinese Learner Corpus (YA-
CLC) (Wang el al. 2021). The JCLC (Wang, Malmasi, Huang 2015) is
a large-scale corpus of L2 Chinese written texts produced by learn-
ers at beginner, intermediate and advanced levels from 59 dierent
L1 backgrounds learning Chinese at dierent universities in China.
Some data are also collected from universities outside China. The
JCLC project, started in 2006, aims to create a learner corpus simi-
lar to the ICLE. The JCLC is an ongoing project since new data con-
tinue to be collected and added to the corpus. The corpus currently
15
 The MIC is no t ava i lab le on l ine; fo r the rele va nt pu bl icati on, see Tsa ng, Yeu ng 2012.
16 The GWLCLC is available online at: https://app.sketchengine.eu/#dashboard
?corpname=preloaded%2Fgua ngwai.
17
Infor mation a bout the GWLCLC is avai lable online at: https://app.sketchen-
gine.eu/#dashboard?corpna me=preloaded%2Fguang wai.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
20
contains 5.91 million Chinese characters across 8,739 texts, which
are completed with a rich set of metadata. The corpus has not been
annotated yet, but, as declared by Wang, Malmasi and Huang (2012,
122), “[t]he inclusion of error annotations and manual corrections is
another pot ent i a l ave nue for fut ure wo rk”. The corpu s is freel y ava i l a-
ble to the research community upon contact with the research team.18
The YACLC (Wang et al. 2021) is a large -sca l e wr itte n mult idimen-
sional-annotated learner corpus. It contains 32,124 sentences from
2,421 essays, provided by around 50,000 CFL learners. Its main char-
acteristic is the multidimensional annotation: for each sentence, an
annotator was allowed to provide a variety of revisions. According to
Wang et al. (2021), revisions include grammatical or uency correc-
tion. Grammatical correction consists in making the sentence con-
form to grammar; whereas uency correction consists in making the
sentence uent and native-sounding. 183 annotators were recruited
and instructed to annotate the corpus, and each sentence was an-
alysed and annotated by ten annotators which worked on a crowd-
sourcing platform specically designed for the annotation process.
This corpus is available online to “further enhance the studies on
Chinese International Education and Chinese automatic grammati-
cal error correction” (Wang 2021, 1).19
Let us now turn to an overview of the oral corpora produced to
date. The Spontaneous Chinese Learner Speech Corpus (Wu, Shih
2014) is a large-scale spoken learner corpus developed at Universi-
ty of Illinois at Urbana-Champaign. It consists of 185 audio and vid-
eo recordings, which were collected during Chinese speech training
classes on a weekly basis from 2004 to 2009. The speakers in this cor-
pus include 11 Chinese language teacher, 11 Korean-speaking learn-
ers, 23 English-speaking learners and 86 Chinese heritage learners.
Participants were asked to complete two dierent oral open-ended
tasks, and each of the tasks was designed to t in a 50-minute class.
The data were transcribed through a transcription website, where
each speaker turn was presented individually with a link to the au-
dio-video les. The data “has been used for perceptual ratings and
acoustic analyses on oral uency and foreign accent” (Wu, Shih 2014,
124). According to Wu and Shih (2014), this spoken corpus is a prolif-
ic resource with speech samples for various research topics.
Other smaller-scale corpora are also available online and allow
users to analyse the L2 acquisition from dierent perspectives. The
Spoken Chinese Corpus of Informal Interaction is a small-scale cor-
pus created by Lin Lu at Massey University, in New Zealand. The
18 The Jinan Chinese Learner Corpus is accessible online, upon contact with the re-
search team, at: ht tp s:// hwy.jnu.ed u.cn /jclc.
19 The YACLC is avai lable online at: http://cuge.baai.ac.cn/#/.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
21
corpus collects spoken data from English-speaking intermediate and
advanced learners from New Zealand and Australia. The data are col-
lected from informal conversation and interaction between 14 learn-
ers of L2 Chinese and Chinese native speakers. It is included in the
list of Learner Corpora Around the World and is available online.20
The COPA Corpus (Zhang 2009) collects speech recordings from
120 college students learning Mandarin in Hong Kong. The data are
gathered from conversation with Chinese native speakers. The cor-
pus is part of the SLABank collection,21 which is a component of Talk-
Bank22 dedicated to providing corpora for the study of second lan-
guage acquisition and learning. The corpus is available for online
browsing and download via TalkBan k.23
Likewise, the HKPU Corpus is also part of the SLABank collection
and is available via Talk Ba nk.
24
It is a small-scale corpus and contains
speech recordings of 20 college students learning Mandarin in Hong
Kong. The tasks involve oral interviews.
The Chinese Subcorpus of LINDSEI25 is a spoken learner corpus
developed at South China Normal University by the team directed by
He Anping. It is included in the database online of LINDSEI Partners,
but, unfortunately, no further information is available.26
20
The Spoken Chinese Cor pus of Informal Interaction is avai lable online at: htt-
ps://git hub.co m/blc ulyn .
21 The SL ABank is a component of Ta lk B an k dedicated to providing corpora for the
study of se cond language acq uisition. It is available at: https://slabank.talkbank.
org/.
22 Talk Bank is a pro je ct org an i se d by Bri an MacWh i nn ey at Ca rneg ie Mellon Uni ver-
sity. Its goa l is to foster fu ndament al research in the st udy of human communication
with an emphasis on spoken communication. Currently, it prov ides rep ositor ies in 14
research areas. Data in Talk Ba nk have been contributed by hundreds of r esearchers
working in over 34 la nguages inter nationally who are committe d to principles of open
data-sharing. Fur ther information is searchable at: h t t p s :// t a l k b a n k .o r g /.
23 Information about the CO PA Cor pus and the link to it are ava ila ble at: h t t p s ://
ww w.clarin.eu/resource-families/L2-corpora.
24
The HKPU Corpus and relat ed informat ion are available at: https://slabank.
talkbank.org/access/Mandarin/HKPU.html.
25 LINDSEI is the Louvain Internat ional Database of Spoken English Interlanguage.
This project was launched in 1995, ve years after the start of the ICLE by the research
team wo rk in g at t he Cath olic Un iversit y of Louva in . The ai m of t hi s pr oje ct was to pr o-
vide a spoken counterpart to ICLE, c ontaining oral dat a produced by advanced learners
of English from several mother tong ue backg rounds. To date, eleven components have
been complet ed and made available on line. LINDSEI and ICLE have been bui lt accord-
ing to si milar pri nciples and share as many as ten mother tongue backgrounds. T his
means t hat they can be used in combination with each other t o compar e spoken a nd
written interlanguage. Nowadays, LINDSEI contains also severa l subcorpora includi ng
a wide va riety of L2s. Further infor mation on LINDSEI and L INDSEI partners is search-
able at: https://uclouvain.be/en/research-institutes/ilc/cecl/lindsei.html.
26 The Chinese Subcorpus of L INDSEI is searcha ble online at: h t t p s :// u c l o u v a i n .
be/en/research-institutes/ilc/cecl/lindsei-partners.html.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
22
In addition to the corpus projects mentioned so far, a few other L2
Chinese interlanguage corpora cited in the literature include those
developed at Ludong University (Hu, Xu 2010), collecting data from
Korean learners, and Nanjing Normal University (Xiao, Zhou 2014),
which also provide a taxonomy for error annotation.
The UCLA Heritage Language Learner Corpus (Ming, Tao 2008)27
is worth a separate discussion. It collects data from Chinese herit-
age language (HL) learners with Chinese family background. They
represent “a specic group of CSL learners because they have usual-
ly acquired some degree of the Chinese language at a young age and
have the advantage in listening to and speaking Chinese” (Zhang,
Tao 2018, 53). The corpus was developed by the team working at the
University of California, Los Angeles, and collects written data by
Chinese HL learners at the intermediate level attending elementary
heritage Chinese classes in 2006 and 2007. As for the corpus size, it
contains approximately 1,000 samples of written essays and compo-
sitions students completed as homework assignments, with 200,000
characters. The genres of the texts are argumentative, narrative, and
descriptive. The corpus is POS and error tagged, following a coding
system which was specically developed for HL learner error annota-
tion. According to Zhang and Tao (2018), it is the rst Chinese learn-
er corpus built in North America.
7.2 L2 Chinese Learners’ Input Corpora
Apart from the corpus-based Chinese interlanguage research, there
is another specic type of learner cor pora: ‘L2 Chinese learners’ in-
put corpora’ (Xu 2019). According to Xu (2019, 43),
[t]he term ‘input corpus’ is used by some learner corpus linguists
meaning the collection of learners’ language exposures such as
teachers’ talk in class as well as the written texts that the learn-
ers are confronted with in learning.
The language input refers to written texts and Chinese language
teaching textbooks that learners are likely to read in real life. The
corpus compilers gather these specic input resources and then turn
them into teaching and learning resources (Xu 2019). For example,
the corpus of Chinese textbooks for international students developed
by the research team at Xiamen University (Su 2010) is freely avail-
able online, and it has become a useful resource for researchers and
27 Infor mation about the UCLA Heritage Language Learner Cor pora is searchable at:
https://nhlrc.ucla.edu/nhlrc/home.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
23
CSL teachers.28 It col lects data from 11 dierent CSL text b ook s pub -
lished from 1996 to 2007 which have been digitalised and convert-
ed into a corpus format. The corpus contains 771,350 Chinese char-
acters (Xu 2019).
A similar project has been developed at Sun Yat-sen University in
Guangzhou. The team has compiled a CSL textbook corpus which in-
cludes data from 1,752 textbooks published from 2006. Specically,
1,802 out of 3,212 textbooks were published outside China (Zhou et
al. 2017). This corpus is not available for public use.
As reported by Xu (2019), the teams working in the compilation of
the above-mentioned corpora have conducted research on the cov-
erage of vocabulary and grammar points across dierent CSL text-
books in order to provide useful learning and teaching sources.
8 Ongoing Research in CSL Learner Corpus Construction
The compilation of CSL LCR, as with learner corpora of many oth-
er languages, remains a challenging research topic for the research
community. Nowadays, many research groups are engaged in the
development of learner corpora that can be adopted in Chinese lan-
guage teaching and learning in the future. For instance, current ef-
forts in L2 Chinese learner corpus work at BLCU are devoted to the
construction of a new large-scale corpus project: the Internation-
al Corpus of Learner Chinese. The projected corpus will collect ap-
proximately 50 million characters, including 45 million written in-
terlanguage Chinese and ve million spoken interlanguage Chinese
characters (Cui, Zhang 2011; Zhang, Cui 2013). It will collect data
of learners from a wide variety of L1 backgrounds and learning con-
texts. Moreover, it will
comprise ve sub-corpora: raw corpus, annotated corpus, statis-
tical information corpus, metadata corpus, and Chinese native
speaker primary and middle school student corpus. (Zhang, Tao
2018, 54)
The corpus will be made available online to interested researchers
and educators (Zhang, Tao 2018).
The team led by Liang Yuan at The Education University of Hong
Kong is working on the construction of a CSL learner corpus for
character-writing error that will be made searchable as an online
28 The corpus of Chinese tex tbooks for int ernat ional students is freely avai la-
ble at: https://ncl.xmu.edu.cn/?query404path=https%3A%2F%2Fncl.xmu.edu.
cn% 2Fshj% 2FDe fault.aspx.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
24
resource for researchers and educators in the eld of CSL. This pro-
ject will collect handwriting data by students who learn L2 Chinese
and will develop a tag-set for the error annotation.29
The present review on L2 learner corpora and the above-men-
tioned description of current work in progress in the eld of CSL
LCR demonstrate that most learner corpus projects mainly collect
data from Asian or English-speaking learners and there is a shortage
of corpora which collect data from learners whose L1 are European
languages. Zhang and Tao (2018) are also in alignment with the re-
sults of the present survey. In fact, they state that since 2010 there
has been a signicant surge of interest in CSL LCR, but
currently available corpora are unbalanced, with data mainly from
Asian learners (specically Korean and Japanese); there are far
less data from European-speaking regions. (Zhang, Tao 2018, 54)
In this respect, Istvanova (2021), for instance, points out that the lack
of L2 Chinese corpora collecting data from Slovakian learners hin-
ders scientic production in the eld of Chinese acquisitional studies
and negatively aects the teaching of Chinese to Slovakian students.
In order to ll this gap, Istvanova (2021) compiled a small-scale cor-
pus of L2 Chinese with data from Slovakian learners with the pur-
pose of gaining a better understanding of the gradual development
of the learner’s interlanguage and error’s variation throughout the
evolving language prociency. Istvanova also hopes that the creation
of the rst Chinese learner corpus of Slovak students will contribute
to increasing the current limited availability of teaching materials
in the language combination Chinese - Slovak.
An analogous situation also arises in the context of the acquisi-
tion of Chinese by Italian-speaking learners. As stated by Romag-
noli and Conti (2021), in Italy, the general interest in Chinese lan-
guage has been echoed by a growing and conspicuous production of
teaching materials and tools of various kinds and levels, with an ev-
er-increasing attention to the dierent types of users and learning
contexts. However, they emphasise that although Chinese teaching
in terms of number of learners and teaching publications is in good
health, research on teaching and acquisition is less developed, par-
tially due to a general lack of scientic interest in this area until re-
cent time. Along the same line, Iurato (2021a), emphasises the need
for corpora which collect data from Italian learners of L2 Chinese
that could be interrogated to investigate the acquisition of Chinese
by Italian learners. However, it must be pointed out that in the last
29 Information about the constr uction of the CSL learner cor pus for cha racter-wr it-
ing err or is sea rchable at: https://www.eduhk.hk/chl/knowledge-transfer.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
25
few years in Italy there has been a remarkable increase in studies in
the eld of L2 Chinese acquisition produced by a constantly grow-
ing research community.30 For an overview of the studies on the ac-
quisition of L2 Chinese produced by the Italian research communi-
ty, see Morbiato (2021) and Romagnoli and Conti (2021).
Nonetheless, despite the undeniable recent proliferation of acquisi-
tional studies conducted in Italy, during the last Graduate Student Con-
ference on Learner Corpus Research 2021, Iurato (2021a) pointed out
that in the Italian context there is a lack of research applying the rigor-
ous methodological framework of the LCR in the compilation of learner
corpora to study the acquisition of L2 Chinese by Italian learners. She
also highlighted that in LCR Chinese is understudied, and it is for this
reason that more research should be developed in this eld. In th is re-
gard, Iurato (2021b) is currently working on the compilation of a writ-
ten and spoken L2 Chinese corpus to study the acquisition of the Chi-
nese shì… de cleft construction by L1 Italian learners. She adopts
a multi-method triangulated approach consisting in the combination
of corpus data and experimental data to provide dierent insights into
the phenomenon under study (Callies 2013; 2015b, Gilquin 2021). The
corpus collects written and spoken data of 103 L1 Italian university
learners studying at Ca’ Foscari University of Venice at elementary, in-
termediate, and advanced prociency levels through open-ended tasks.
The written corpus includes 53,248 Chinese characters and 2,337 oc-
currences of the shì...de construction. The spoken data (24 hours of
speech) were manually transcribed and contain 19,073 Chinese char-
acters and 1,305 occurrences of the shì...de construction. Moreover, Iu-
rato collected data from 30 L1 Chinese speakers to include an equiva-
lent native-speaker control group. All data are complemented by a rich
set of metadata with learners’ and native speakers’ sociolinguistic var-
iables. She also developed a target-oriented error taxonomy to manu-
ally annotate the grammatical errors; a pragmatic annotation was al-
so added to detect the inappropriate use of the pragmatic functions
(highlighting information and contrastive focus) of the shì…de cleft con-
struction. Following Granger (2012) and Díez-Bedmar (2015), the iden-
tication of errors was carried out simultaneously by a bilingual team
composed by two Chinese native-speaking experts and the research-
er whose L1 is the same as the learners. Furthermore, to counterbal-
ance potential construct underrepresentation (Tracy-Ventura, Myles
2015), she collected experimental data through additional experimen-
ta l tas ks. This resear ch, to the best of our knowledge, is the rs t st udy
grounded in the LCR framework which explores the acquisition of a
specic syntactic linguistic feature by L1 Italian learners of Chinese.
30 See, for example, Conti 2021; Conti, Lepedat 2021; Eletti, Casentin i, Fontanarosa
2021; Gabbianelli 2020; Morbiato 2020; Romagnoli 2018; 2021; Tucci 2021.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
26
Thanks to the funding for the Research Project of National Inter-
est 2020 (PRIN 2020) allocated by MUR (Italian Ministry of Univer-
sity and Research), and co-funded by Università Ca’ Foscari Vene-
zia and Università degli Studi Roma Tre, the research group led by
Bianca Basciano will work on the compilation of a new learner cor-
pus to analyse the acquisition of Chinese resultative verbal complex-
es by L1 Italian learners. This project is also contextualised in the
LCR framework and it is based on the combination and analysis of
learner corpora and experimental data.
The two above-mentioned corpora will be useful to support the
SLA and LCR community for the development of pedagogical tools.
Moreover, they will be expanded and made available to other schol-
ars and educators for further research.
9 Concluding Remarks
LCR has continuously grown over past the few decades, and it is a
widely recognised branch of the broader eld of corpus linguistics.
As Tracy-Ventura and Paquot (2021b, 32) highlight,
LCR has constantly questioned its role, methods, and goals, and
has, as a result, evolved remarkably over the last thirty years.
The application of corpora has enabled researchers to
explore areas as diverse as second language acquisition, psycho-
linguistics and natural language processing and to utilize their
research ndings within L2 pedagogy. (Gráf 2017, 22)
The eld has seen its worldwide expansion on two main fronts: rst,
the increasing assortment of L2s for which learner corpora are be-
ing produced; second, the growing number of learner corpora used
to analyse language learner use (Gráf 2017). As stated by Gráf (2017,
22), “[s]omewhat sadly – if perhaps not surprisingly – the target lan-
guage for most learner corpora is English”.
Nonetheless, this paper revealed how, since 2010, there has been
a surge of interest in CSL LCR, reecting what Zhang and Tao (2018,
54) believe is a “shift of attention to corpora in the eld of CSL. In
the last decade, a plethora of studies have addressed methodologi-
cal issues to improve corpus construction, including collection of da-
ta, compilation, annotation, user interface, and application in teach-
ing and learning. Moreover, the construction of L2 Chinese learner
corpora has become the empirical basis for many doctoral disserta-
tions, monographs, and research papers. Since the launch of the rst
L2 Chinese learner corpus in 1995, CSL LCR and related research
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
27
have achieved a great deal. CSL LCR is a signicant resource both
for CSL research and corpus linguistics research. In fact, with the
growing advancement in the construction of corpus tools, CSL LCR
is intended to play an even more important role in understanding the
acquisition of L2 Chinese.
As highlighted by Stewart, Bernardini and Aston (2004), there are
dierent types of interaction between language corpora and learners.
Learners may be the authors or contributors of corpus data, they may
be the ultimate beneciaries of a corpus insights, e.g. through the
intermediation of the teacher, or they themselves may interrogate a
corpus to better understand the syntactic, semantic, and pragmatic
properties of a linguistic feature. This potential of learner corpora
should be taken into account and applied in the context of Chinese
learning and teaching to improve learners’ explicit knowledge of the
Chinese language. With regard to the pedagogical implications, it is
also important to stress that LCR can be useful in two ways: the di-
rect use and the indirect use (Zhang, Tao 2018). The direct use re-
fers to the use of learner corpora to guide the writing of textbooks
and dictionaries. CSL textbook writers and pedagogical materials
developers should take advantage from
[t]he rich understandings gained from LCR, including the acqui-
sition orders and developmental sequences of dierent linguis-
tic features, the typical errors learners tend to commit at dier-
ent levels, and the desirable and less desirable L1 eects. (Zhang,
Tao 2018, 56)
CSL LCR is also useful in dictionary compilation. For example, the
Cambridge Advanced Learners’ Dictionary (2003) us ed a lea r ner cor-
pus to include information about learner errors. As claimed by Zhang
and Tao (2018), also L2 Chinese learner corpora could be used to com-
pile dictionaries which includes information on frequent learner er-
rors. As for the direct use of L2 Chinese learner corpora, teachers
and students can use them for pedagogical purposes in L2 Chinese
classrooms. Moreover, CSL learner corpora can support the process
of CSL assessment. Similarly to other learner corpora, L2 Chinese
learner corpora
can serve as critical resources by providing quantitative, empir-
ical information that can guide the development of assessment
measures, such as placement, texts, exit tests, and other types of
prociency assessment. (Zhang, Tao 2018, 57)
The present review of existing L2 Chinese learner corpora suggests,
in addition, a lack of learner corpora exclusively collecting data from
Italian learners. Specically, there is a lack in the compilation of L2
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
28
Chinese learner corpora applying the LCR methodological frame-
work. The research carried out by Iurato (2021a; 2021b) and the PRIN
research project directed by Basciano are a rst contribution which
aims at lling this gap; nonetheless, further studies need to be de-
veloped towards this direction in the future.
Finally, it is important to specify that, due to the unique charac-
teristics of Chinese, corpus tools developed for English or other Eu-
ropean languages cannot be easily applied to analyse Chinese; there-
fore, the compilation of learner corpora remains a challenging topic
for the CSL LCR community, especially for those interested in ana-
lysing the acquisition of L2 Chinese by learners whose L1s are not
English or Asian languages.
Bibliography
Basciano, B.; Gatti, F.; Morbiato, A. (2020). “Introduction”. Basciano, B.;
Gatti, F.; Morbiato, A. (eds), Corpus-Based Research on Chinese Lan-
guage and Linguistics. Venezia: Edizioni Ca’ Foscari, 7-15. Sinica vene-
tiana 6. https://edizionicafoscari.unive.it/it/edizioni4/
libri/978-88-6969-407-3/introduc tion/.
Barlow, M. (2005). “Computer-Based Analysis of Learner Language”. Ellis, R.;
Barkhuizen, G. (eds), Analysing Learner Language. Oxford; New York: Ox-
ford Univer sity Press, 335-57.
Bell, P.; Collins, L.; Marsden, E. (2021). “Building an Oral and Written Learner
Corpus of a S chool Programme: Met hodological Issu es”. Le Bruyn, Pa quot
2021a, 214- 42. https://doi.org/10.1017/9781108674577.011.
Biber, D.; Conrad, S.; Reppen, R. (1998). Corpus Linguistics: Investigating Lan-
guage Structure and Use. Cambridge: Cambridge University Press. htt-
ps://doi.org/10.1017/CBO9780511804 489.
Brezina, V.; McEnery, T. (2021). “Introduction to Corpus Linguistics”. Tracy-
Ventura, Paquot 2021a, 11-22.
Callies, M . (2013). “Triangulation”. Schierhol z, S.J.; Wiegan d, H.E. (Hrsgg ), Wör-
terbücher zur S prach - und Kommunikationswiss enscha [WSK] Online. B er-
lin: De Gruy ter Mouton. ht tp s://w w w.w sk .fa u.de/.
Callies, M. (2015a). “Learner Corpus Methodology”. Granger, Gilquin, Meuni-
er 2015a, 35 -55. https://doi.org/10.1017/CBO 9781139649414.003.
Callies, M . (2015b). “Using Corp ora in Language Testing and A ssessment: Cur-
rent Practice and Future Challenges”. Castello, Ackerley, Coccetta 2015,
21-35.
Callies, M.; Götz, S. (2015). “Learner Corpora in L anguage Testing and Assess-
ment. Prospect and Challenges”. Callies, M.; Götz, S. (eds), Learner Corpo-
ra in Language Testing and Assessment. Amsterdam; Philadelphia: John
Ben jamins, 1-9.
Cambridge Advanced Learner’s Dictionary (2003). Cambridge: Cambridge Uni-
versit y Press.
Castello, E.; Ackerley, K.; Coccetta, F. (eds) (2015). Studies in Learner Corpus
Linguistics. Bern: Peter Lang. https://doi.org/10.3726/978-3-0351-
0736-4.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
29
Chang Liping 张莉萍 (2013). “TOCFL Zuowe n yuliaoku de jianzhi yu ying yong”
TOCFL 作文 料库的 置与应用 (Compilatio n and applications of t he TOC-
FL Composit ion Corpus). Cui Xiliang 希亮 Zhang Baolin 张宝林 (eds), Di’er
jie Hanyu zhong jieyu yuliaoku jianshe yu ying yong xueshu taolunhui lunwen
xuanji 第二届汉语中介语语料库建设与应用国际学术讨论会论文 选集 (S e-
lected papers from the 2nd International Conference on the Construction
and Applications of Chinese Learner Corpora). Beijing: Beijing Language
and Culture University Press, 141-52.
Chu, C.; Chen, X. (1993). “Constructing a Chinese Interlanguage Corpus”. Shi-
jie Hanyu Jiaoxue, 7(3), 199-205.
Chu Chengzhi 储诚志 et al. (1995). “Hanyu zhongjieyu yuliaoku xitong yanzhi
baogao” 语中介语语料库系统研制报告 (Research Repor t of The Corpus
of Chinese Interlanguage [CCI 1.0]). Beijing: Beijing Language and Culture
University Press .
Conti, S. (2021). “Italian Learners’ Use of Chinese Sentence-Final Particles:
Marking Interrogatives in a Tandem-Learning Context. Instructed Sec-
ond Language Acquisition, 5(2), 202-31. h t t p s :// d o i. o r g / 1 0. 1 5 5 8 / is -
la.18 8 13.
Conti, S.; Le pedat, C. (2021). “Situatio n-based Utteran ces in italiano e in cines e:
un confronto tra parlanti nativi e apprendenti italofoni”. Romagnoli, Con-
ti 2021, 39-69.
Corder, S. (1975). “Error Analysis, Interlanguage and Second Language Acqui-
sition”. Language Teaching & Linguistics: Abstracts, 8(4), 201-18. h t t p s: //
doi.org/10.1017/s026144 4800002822.
Cui Xiliang 崔希亮 ; Zhang Baolin 张宝林 (2011). “Quanqiu hanyu xuexizhe
yuliaoku jianshe fangan” 全球汉语学习者语料库建设方案 (A Proposal for
the Building of the International Learner Corpus of Chinese). Yuyan Wenzi
Yingyong, 19(2), 100-8.
Dahlmeier, D.; Ng, H.T.; Wu, S.M. (2013). “Building a Large Annotated Corpus
of Learner English: the NUS Corpus of Learner English”. Proceedings of the
8th Worksho p on the Innovative Use of NLP for Bu ilding Educational Applica-
tions (BEA’13), 22-31.
Díez-Be dmar, M.B. (2015). “Dealing with Err ors in Learner Cor pora to Descr ibe,
Teach and Assess EFL Writing: Focus on Article Use”. Castello, Ackerley,
Coccetta 2015, 37-69.
Eletti, V.; Casentini, M.; Fontanarosa, L. (2021). “Lo sviluppo della sensibilità
sublessicale negli apprendenti italiani di cinese lingua straniera”. Roma-
gnoli, Conti 2021, 159-80.
Ellis, R. (1985). Understanding Second Language Acquisition. Oxford: Oxford
University Press .
Ellis, R. (1987 ). Second Language Acquisiti on in Context. New York: P rentice Hall.
Ellis, R. (199 4). The Study of Second Language Acquisition. Oxford: Oxford Uni-
versit y Press.
Feng, Z. (2006). “Evolution and Present Situation of Corpus Research in Chi-
na”. International Journal of Corpus Linguistics, 11(2), 173-207. ht t p s ://
doi.o rg/10.107 5/ij cl.1 1.2.0 3f e n .
Gabbianelli, G. (2020). “Video-Based Instruction and Students’ Perception of
Cultural Understanding and Motivation in the Chinese Foreign Language
Classroom”. International Journal of Chinese Language Education, 8, 37-70
Gillard, P.; Gadsby, A. (1998). “Using a Learners’ Corpus in Compiling ELT
Dictionaries”. Granger, S.; Leech, G. (eds), Learner English on Com-
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
30
puter. London: Addison Wisley Longman, 159-71. ht t p s :// d o i.
org /10.4324/978131 58 413 42-12.
Gilquin, G. (2015). “From Design to Collection of Learner Corpora”. Granger,
Gilqin, Meunier 2015a, 9-34.
Gilquin, G. (2021). “Combining Learner Corpora and Experimental Methods”.
Tracy-Ventura , Paquot 2021a, 133- 44.
Gilquin, G.; Gr ies, S.T. (2009). “Corpora and E xperimental M ethods: A State- of-
the-Art Review”. Corpus Linguistics and Linguistic Theory, 5(1), 1-26. htt-
ps://do i.org /10.15 1 5/CLLT.20 0 9.001.
Gráf, T. (2017). “The Story of the Learner Corpus LINDSEI_CZ”. Studie z Apliko-
vane Lingvistiky, 8(2), 22-35.
Granger, S. (200 2). “A Bird’s-Eye Vie w of Learner Corpu s Research”. Granger, S.;
Hung, J.; Petch -Ty son, S. (eds), Computer Learner Corp ora, Second Language
Acquisition and Foreign Language Teaching. Amsterda m; Philadephia: John
Benjamins, 3-33. h t t ps://doi.org /1 0.10 75/ lllt.6.0 4 g ra.
Granger, S. (2003). “The International Corpus of Learner English: A New Re-
source for Foreign Language Learning and Teaching and Second Lan-
guage Acquisition Research”. TESOL Quarterly, 37(3), 538-46. h t t p s ://
doi.org/10.2307/3588404.
Granger, S. (2008). “Learner Corpora”. Lüdeling, A.; Kytö, M. (eds), Corpus Lin-
guistics. An International Handbook, vol. 1. Berlin; New York: Walter de
Gruyter, 259-75. https://doi.org/10.1080/00393270903392342.
Granger, S. (2012). “How to Us e Foreign and Second L anguage Learner Cor po-
ra”. Mackey, A.; Gass, S.M. (eds), Research Methods in Second Language Ac-
quisition. A Practical Guide. Hoboken (NJ); Oxford: Wiley-Blackwell, 7-29.
https://doi.org/10.1002/9781444347340.ch2.
Granger, S. (2021a). “Commentary: Have Learner Corpus Research and Sec-
ond Language Acquisition Finally Met?”. Le Bru yn, Paquot 2021a, 258 -73.
Granger, S. (2021b). “Once Up on a Time… A Tale of Learner Corpus Re search”.
Paper pre sented at The Graduate Student Con ference in Learner Corpus Re -
search 2021 (Elverum, Inland Norway University of Applied Sciences, 12
October 2021).
Granger, S. et al . (2009). International Cor pus of Learner English (Ver sion 2). Lou-
vain-a- la-Neuve: Press es universitaires de Louvain.
Granger, S.; Gilquin, G.; Meunier, F. (eds) (2015a). The Cambridge Handbook of
Learner Corpus Research. Cambridge: Cambridge University Press. htt
-
ps://doi.org/10.1017/CBO9781139649414.
Granger, S.; Gilq uin, G.; Meunier, F. (2015b). “Introdu ction: Learner Cor pus Re-
search – Past, Present and Future”. Granger, Gilquin, Meunier 2015a, 1-5.
https://doi.org/10.1017/CBO9781139649414.001.
Hu Xiaoqing 胡晓清; Xu Xiaoxing 许小星 (2010). “Mianxiang zho ngwen dianhua
jiaoxue de hanguo liuxuesheng hanyu zhongjieyu yuliaoku de kaifa yu ji-
anshe” 面向中文电话教学的国留学生汉语中介语语料库的开发与建设
(The Deve lopment of a Computer-A ssisted Chin ese Language Teaching Ori -
ented Korean Students’ Interlanguage Chinese Corpus). Shuzihua Duiwai
Hanyu Jiaoxue Sh ijian yu Fansi, 19, 403-10.
Istva nova, M. (2021). “Chinese Le arner Corpora a nd Creation of Slovak Lea rner
Corpus of Chinese.” The Silk Road. Language and Culture, 48-55.
Iurato, A. (2 021a). “Compiling a Corpus of Wr itten and Spoken L 2 Chinese: Com-
bining Pragma tic -and-Error- An notation to Study the C hinese shì de
Cle Construction”. Paper presented at The Graduate Student Conference
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
31
in Learner Corpu s Research 2021 (Elverum, Inlan d Norway Universi ty of Ap-
plied Sciences, 12 Oc tober 2021).
Iurato, A. (2021b). “The Acquisi tion of the Chinese shì de Construction
by L1 Italian Learners: A Preliminary Analysis Based on a Learner Corpus
and Experimental Data”. Paper presented at the 6th International Confer-
ence on Chinese as a S econd Language Research (C ASLAR 6-2021) (Wa shing-
ton, George Washington Universit y, 31st July 2021).
Le Bruyn, B.; Paquot, M. (eds) (2021a). Learner Corpus Research Meets Second
Language Acquisition. Cambridge: Cambridge University Press. h t t p s ://
doi.org/10.1017/9781108674577.
Le Bruyn , B.; Paquot, M. (2021b). “Lear ner Corpus Res earch and Second L anguage
Acquisition: An Attempt at Bridging the Gap”. Le Bruyn, Paquot 2021a, 1-9.
https://doi.org/10.1017/9781108674577.002.
Lee, L.; Tseng, Y.; Chang, L. (2018). “Building a TOFCL Learner Corpus for Chi-
nese Gra mmatical Error Diag nosis”. Proceedings of the Eleventh Internation-
al Conference on Langu age Resources and Evaluation (LREC 2018). European
Language Resource Association, 2298 -304.
Leech, G. (2011). “Frequency, Corpora and Language Learning”. Meunier, F.;
Cock, S.; Gilquin, G. (eds), A Taste for Corpora: In Honour of Sylviane Grang-
er. Amsterdam: John Benjamins, 7-31.
Lu, X.; Che n, B. (2019). “Computational a nd Corpus Approa ches to Chinese La n-
guage Learni ng: An Introduc tion”. Lu, X.; Chen, B. (eds), Computational and
Corpus Approa ches to Chinese Language Lea rning. Singapore: Sp ringer, 3-11.
Lüdeling, A.; Hirschmann, H. (2015). “Error Annotation Systems”. Grang-
er, Gilquin, Meunier 2015a, 135-57. https://doi.org/10.1017/
CBO9781139649414.007.
McCar thy, M.; O’Keee, A . (eds) (2010a). The Routledge H andbook of Corpus Lin-
guistics. London: Routledge.
McCarthy, M.; O’Keee, A. (2010b). “Introduction”. McCar thy, O’Keee 2010a,
1-28.
McEnery, T. et al. (2019). “Corpus Linguistics, Learner Corpora, and SLA: Em-
ploying Technology to Analyze Language Use”. Annual Review of Applied
Linguistics, 39, 159-75.
McEner y, T.; Hardie, A. (2011). Corpus Linguis tics: Method, Theo ry and Practice.
Cambridge: Cambridge University Press.
McEnery, T.; Wilson, A. (1996). Corpus Linguistics. Edinburgh: Edinburgh Uni-
versit y Press.
McEnery, T.; Xiao, R.; Tono, Y. (2006). Corpus-Based Language Studies: An Ad-
vanced Resource Book. London: Routledge.
McEnery, T.; Xiao, R. (2016). “Corpus-Based Study of Chinese”. Chan, S. (ed.),
The Routledge Encyclopedia of the Chinese language. London: Routldge,
438-51.
Meunier, F. (2021). “Introduction to Learner Corpus Research”. Tracy-Ventu-
ra, Paquot 2021a, 23-3 6.
Ming, T.; Tao, H. (2008). “Developing a Chinese Heritage Language Corpus: Is-
sues and a Preliminary Report”. He, A.W.; Xiao, Y. (eds), Chinese as a Herit-
age Language: Fos tering Rooted World Citizen ry. Honolulu: National Foreign
Language Resource Center, University of Hawai‘i, 167-78.
Morbiato, A . (2020). “Acquisition of Double-Nominative Construc tions by Ital-
ian L1 Learners of Chinese. A Cross-Sectional Corpus Study”. Annali di
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
32
Ca’ Foscari. Serie Orientale, 56, 377-408. h t tp://d oi.org /10.306 87/An -
nOr/2385-3042/2020/56/015.
Morbiato, A. (2021). “Una Panoramica degli Studi sull’Acquisizione di Aspet-
ti Sintattici e Strutture Grammaticali del Cinese da Par te di Italofoni”. La
Lingua Cinese in Italia. Studi su Didattica e Acquisizione. Roma: RomaTre-
Pres s, 87-1 14.
Myles, F. (2015). “Second Language Acquisition Theory and Learner Cor-
pus Research”. Granger, Gilquin, Meunier 2015a, 309-31. h t t p s :// d o i.
org/10.1017/CBO9781139649414.014.
Myles, F. (2021). “Commentary: As SLA Perspective on Learner Corpus Re-
search”. Le Bruyn, Paquot 2021a, 2 58-73.
Nessel hauf, N. (2004). “Learne r Corpora and th eir Potential in Language Teach -
ing”. Sinclair, J. (ed.), How to Use Corpora in Lang uage Teaching. Amsterdam:
John Benjamins, 125 -52. ht t p s://doi.or g /10.1075/s cl.12.11 n e s.
Nicholls, D. (2003). “The Cambridge Learner Corpus – Error Coding and Anal-
ysis for Lexicography and ELT”. Proceedings of the Corpus Linguistics 2003
Conference (CL’03), 572-81.
Richards, J. (19 80). “Second Language A cquisition: Error Ana lysis”. Annual Re-
view of Applied Linguistics, 1, 91-107.
Romagnoli, C. (2018). “The Acquisition of Mandarin Sentence Final Particles
by Italian Lear ners”. International Revi ew of Applied Linguist ics in Language
Teaching, 58(4), 475-94.
Romagnoli, C . (2021). “Dire quasi la stes sa cosa: l’apprendimento d ei sinonimi
in cinese Com e lingua straniera”. Romagnoli, Conti 2021, 71-86 .
Romagnoli, C .; Conti, S. (a cura di) (2021). La lingua cinese in Italia. Studi su di-
dattica e acquisizione. Roma: Roma TrE-Press.
Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford Univer-
sity Press.
Stewart , D.; Bernardini, S.; A ston, G. (eds) (200 4). “Introduction”. Asto n, G.; Ber-
nardini, S.; Ste wart, D., Corpora and Langua ge Learners. Amste rdam: John
Benjamins, 1-18. ht t p s://doi.or g /10.1075/s cl.17.01s t e.
Stubbs, M.; Halbe, D. (2012). “Corpus Linguistics: Over view”. Chapelle, C.A .
(ed.), The Encyclopedia of A pplied Linguistic s. Wiley Online Library. h t t p s ://
onlinelibrar y.wiley.com/doi/book/10.1002/9781405198 431.
Su Xinchun 苏新春 e t al. (2010). “Jiaoc ai yuyan tongj i yanjiu de duoweidu gong-
neng” 教材语言统计研究的多维度功能 (T he Multi-Dimen sional Function of
Statisti cal Research on Textbo ok Language). Proceedings of t he Innovation
of Internation al Chinese Teaching Theories and M odels Conference. Xiamen:
Xiamen Chubanshe, 128 -41.
Teng, S. et al. (2007). “Huayuwen xuexizhe hanzi pianwu shuju ziliaoku jian-
li ji pianwu leix ing fenxi” 华语文汉字偏误数据资料库建立及偏误类型分析
(The Cons truction of Chi nese Learners ’ Character Writ ing Error Databse an d
the Analy sis of Error Types). Proceedings of 20 07 National Linguistic s Confer-
ence, 313-25. Tainan: National Cheng Kung University.
Tognini Bonelli, E. (2010). “Th eoretical Ove rview of the Evolut ion of Corpus Lin-
guistic s”. McCart hy, O’Keee 2010a, 14-28.
Tono, Y. (2003). “Learner Corpora: Design, Development and Applications”.
Archer, D. et al. (eds), Procee dings of the Corpus Linguis tics 2003 Conferen ce.
UCREL Technical Pap er, 16, 800-9.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
33
Tracy-Ventura, N.; Myles, F. (2015). “The Importance of Task Variability in the
Design of Lea rner Corpora f or SLA Researc h”. International Journal of Learn-
er Corpus Research, 1(1), 58- 95.
Tracy-Ventura, N.; Paquot, M. (eds) (2021a). The Routledge Handbook of Sec-
ond Language Acquisition and Corpora. London: Routledge. h t t p s :// d o i.
org/10.4324/9781351137904.
Tracy-Ventura , N.; Paquot, M. (2021b). “Se cond Language Acquis ition and Cor-
pora. An O verview”. Tracy-Ventura , Paquot 2021a, 1-8.
Tsang, W.; Yeung, Y. (2012). “The Development of the Mandarin Interlanguage
Corpus (MIC) – A Preliminary Report on a Small-Scale Learner Database”.
JALT Journal, 34(2), 187-208 . htt ps://d oi.org/10.3754 6/JALTJJ3 4.2-1.
Tucci, T. (2021). “Il Principle of Temporal Sequence nell’apprendimento del sin-
tagma loca tivo ‘zài + luogo’: un’indagine preliminare su discenti italofo -
ni”. Romagnoli, Conti 2021, 115-39.
Wallace Robine tt, B.; Schacht er, J. (eds) (1983). Second Langu age Learning: Con-
trastive An alysis, Error Analys is, and Related Asp ects. Ann Ar bor (MI): Univer-
sity of Michigan Press .
Wang, M.; Malma si, S.; Huang, M. (2015). “The Jinan Chinese Learner Corpus”.
Proceedings of t he Tenth Workshop on I nnovative Use of NLP for Building Ed -
ucational Applications. Denver (CO): Assoc iation for Computat ional Linguis
-
tics, 118 -23. https://doi.org/10.3115/v1/W15-0614.
Wang, Y. et al. (2021). “YACLC : A Chinese Learner Cor pus with Multidimensi on-
al Annotation”. Computer Science – Computation and Language, 1-5. htt-
ps://doi.org/10.48550/arXiv.2112.15043.
Wu, C.; Shih, C . (2014). “A Design of the Spontaneous Chinese Learner Speech
Corpus”. Learner Corpus Studies in A sia and the World, 2, 115-24. h t t p s ://
do i.org /10.24546/8 1006 69 4.
Xiao Xiqiang 肖奚 ; Zh ou Wenhua 周文华 (2014). “Hanyu zhong jieyu yuliaoku
biaozhu de quanm ianxing ji leibie wenti” 汉语中介语语料库标注的全面性
及类别问题 (The E xhaustiveness and Taxonomy of Chinese Interlanguage
Corpus Annotation). Shijie Hanyu Jiaoxue, 28(3), 368 -77.
Xu, J. (2015). “Corpus-Based Chinese Studies: A Historical Review from the
1920s to the Present”. Chinese Language and Discourse, 6(2), 218- 44.
Xu, J. (2019). “The Corp us Approach to the Teaching an d Learning of Chinese a s
an L1 and an L 2 in Retrospect ”. Lu, X.; Chen, B. (eds), Computational and Cor-
pus Approach es to Chinese Language Learning. S ingapore: Springer, 33- 53.
Zhan, W. et al. (2006). “Recent Developments in Chinese Corpus Research”.
Proceedings of the 13th NIJL International Symposium, Language Corpora:
Their Compilation and Application. Tok yo: Universit y of Tokyo Pres s, 315-36.
Zhang Baolin 张宝林 (2003). “HSK dongt ai zuowen yuliaoku jianjie” HSK 动态
作文语料库简介 (Introducing C hinese Proficie ncy Test Dynamic Es say Cor-
pus). Ceshi Yanjiu, 1(4), 37-8.
Zhang Baolin 张宝林 (2010). “Guanyu tongyongxing Hanyu zhongjieyu yulia-
oku biaozhu moshi de zai renshi” 关于通用型汉语中介语语料库标注模式
的再认识 (Re-Co nsidering the Mo dels of Annotatio n of All-Purpos e Chinese
Interlanguage Corpus). Shijie Hanyu Jiaoxue, 27(1), 128-40.
Zhang Baolin 张宝林 (2013). “Hui bi yu fanhua – Jiyu HSK Dongta i Zuowen Yulia-
oku de ‘ba’ zi ju xide kaocha” 回避与繁华基于 HSK 动态作文语料库的“
把”字句习得考察 (Avoidance and Overgeneralization – An Investigtion of
Acquisition of the Ba-Sentence Based on the HSK Dynamic Composition
Corpus). Shijie Hanyu Jiaoxue, 24(2), 263-78.
Annali di Ca’ Foscari. Serie orientale e-ISSN 2385-3042
58, 2022, 1-34
34
Zhang Baolin 张宝林; Cui Xiliang 崔希亮 (2013). “‘Quanqiu hanyu zhongjieyu
yuliaoku jianshe he yanjiu’ de sheji linian” “全球汉语中介语建设和研究
的设计理念 (Design Concepts of “The Construction and Research of the
Interlangu age Corpus of Chinese fr om Global Learners” ). Yuyan Jiaoxue Yu
Yanjiu, 24(5), 27-34.
Zhang, J. (2014). “A Learner Corpus Study of L2 Lexical Development of Chi-
nese Resu ltative Verb Compou nds”. Journal of the Chinese Langua ge Teach-
ers Association, 49(3), 1-24.
Zhang, J.; Tao, H. (2018). “Corpus -Based Resear ch in Chinese as a Secon d Lan-
guage”. Ke, C. (ed.), The Routledge Handbook of Chinese Second Language
Acquisition. London; New York: Routledge, 48- 62.
Zhang Ruipeng 张瑞朋 (2017). “Hanyu zhong jieyu yuliaoku zhong de hanzi pi-
anwu chuli yanjiu” 汉语中介语语料库中的汉字偏误 处理研究 (The Charac-
ter Errors i n Chinese Interlang uage Corpora). Yuli aoku Yuy anxue, 3(2), 50 -9.
Zhang, Y. (2009). A Tutor for Learning Chinese Sounds through Pinyin [PhD Dis-
sertation]. Pitt sburgh (PA): Carnegie Mellon University.
Zhou Xiaobi ng 周小兵 et al. (2017). “Guoji hanyu jiaoca i yuliaoku de jianshe yu
yingyong” 国际 汉语 教材 料库的建设与应用 (The Construction and Appli-
cation of International Chinese Textbook Corpus). Yuyan Wenzi Yingyong,
25(1), 125-35.
Alessia Iurato
Learne r Corpus Resear ch Meets Chine se as a Second La nguage Acquisit ion
... For example, integrating cultural context explanation into vocabulary teaching can significantly improve students' comprehension accuracy. Through scenario simulations, case studies, and other teaching strategies, the organic integration of language knowledge transmission and cultural awareness cultivation is achieved [2]. For example, when traditional measurement culture is introduced into the teaching of Chinese quantifiers, students' application accuracy increases by 22%. ...
Article
Full-text available
This study explores the interaction mechanism between language training and cultural transmission in Chinese second language teaching in the context of globalization. This paper analyzes the phenomenon of classroom cultural conflicts from an intercultural perspective, and proposes the teaching strategy to equip the teaching of language knowledge and the cultivation of cultural cognition. Based on the quantitative data of more than 300 learners and the qualitative materials of classroom observation and teacher interviews, the study revealed a significant positive correlation between cultural aptitude and language achievement (r = 0.48). Classes with a culturally responsive approach saw a 27% increase in student engagement and a 34% decrease in cultural misreading. For example, after integrating calligraphy experience into Chinese character teaching, students' font memory accuracy increased by 19%. The research results provide an empirical basis for Chinese international educational institutions, and suggest adding intercultural teaching design modules into teacher training and developing a three-dimensional curriculum system including cultural scenario simulation.
Article
Full-text available
The present study shows the results of a longitudinal research on a corpus of oral productions by Chinese learners of Italian as a Second Language in an academic context. The study presents the theoretical underpinnings on corpora research and the state of the art for the Italian language as well as the studies done on the development of the interlanguage and the acquisitional theories underlying the process of language acquisition concerning the Italian language. Specifically, it focuses on the patterns of acquisition of the noun phrase. Successively, the methodology of the research is explained and the data is presented and statistically analysed. Finally, pedagogical suggestions are drawn from the data gathered. Il sintagma nominale nell’interlingua di studenti universitari cinesi che apprendono l’italiano come lingua seconda: un corpus study Il presente studio mostra i risultati di una ricerca longitudinale su un corpus di produzioni orali di apprendenti cinesi di italiano come seconda lingua in un contesto accademico. Lo studio presenta i fondamenti teorici sulla ricerca sui corpora e sullo stato dell'arte della lingua italiana nonché gli studi condotti sullo sviluppo dell'interlingua e le teorie acquisizionali alla base del processo di acquisizione linguistica riguardante la lingua italiana. Nello specifico, si concentra sui modelli di acquisizione della frase nominale. Successivamente viene spiegata la metodologia della ricerca e i dati vengono presentati e analizzati statisticamente. Dai dati raccolti, infine, si traggono suggerimenti pedagogici.
Article
With the development of science and technology, the concept of big data and the Internet of Things is being used in the social life of ordinary people widely more and more. Realizing the efficient use of big data and Internet of Things technology, has become an important means to improve work efficiency and scientific research results. Such a rapid development trend makes each team actively participate in the application of these technologies, hoping to make their work more in line with the trend of the times. Especially for the collection and construction of some databases, it has more advantages than traditional tools. This research makes full use of this technical advantage, selects the abstracts of the core journals of forestry at home and abroad as the corpus source, and builds a small English corpus of forestry. The corpus is used in English teaching in colleges and universities, use a combination of comparative analysis and sampling surveys, and can play an active role in vocabulary learning and translation practice, semantics, grammar, syntax teaching and thesis writing, etc. Stimulate students’ interest in learning and enhance their independent learning ability and spirit of exploration. The experimental data shows that the English corpus constructed under the background of big data and the Internet of Things improves the average retrieval efficiency of college students by 27.3%, and the comprehensiveness of retrieval items increases by 9.6% on average. These improvements have a very positive effect on the development of foreign language teaching in related professional colleges and universities.
Article
Full-text available
This paper presents new results of an ongoing cross-sectional corpus study investigating the acquisition of Chinese word order by Italian L1 learners. Specifically, it focuses on the acquisition of ‘double-nominative constructions’, as well as the correct sequential organisation of topical and focal information in the Chinese sentence. The analysis is conducted on three learner corpora, created by the Author on the basis of a test submitted to three groups of university (BA and MA)-level Italian L1 learners of Chinese, for a total of 132 learners. Quantitative and qualitative analysis conducted on the collected data show that, while the double-subject construction may appear as a simple and straightforward pattern, it is in fact a rather difficult construction to acquire and spontaneously produce for Italian L1 learners. Rather, students tend to use patterns they are used to in their L1 (or other L2s, such as English). These include the [NP1 have NP2], [NP1 的 NP2], or [NP1 adjectival predicate ] patterns, among other types, thus confirming the inhibitive L1 transfer hypotheses of this study.
Chapter
Full-text available
Il cinese è una lingua morfo-sillabica il cui sistema ortografico presenta tre livelli: il tratto, il radicale e il carattere. Ogni carattere occupa uno spazio tipografico e corrisponde ad un morfema e la parola è, di solito, composta da due caratteri. Questo articolo mostra come si sviluppa la sensibilità linguistica verso gli elementi sublessicali negli apprendenti italofoni che studiano cinese nella scuola secondaria. Nello specifico si fornirà un quadro dello sviluppo della sensibilità strutturale, submorfemica, e morfemica che sembrano tutte e tre incrementali rispetto all'anno di studio della lingua e in parziale correlazione tra loro. PAROLE CHIAVE: Lingua cinese, Riconoscimento parole, Sensibilità sublessicale, Sensibilità strutturale, Sensibilità submorfemica, Sensibilità morfemica, Abilità di lettura Chinese is a morpho-syllabic language whose script has three distinct layers: the stroke, the radical and the character. Each character occupies an independent typographic space and corresponds to a morpheme. The word is usually made up of two characters. This article aims to see how linguistic transparency towards the sub-lexical elements of Chinese develops in Italian-speaking learners who study the language in secondary school. We will provide a picture of the development of the structural transparency, of the submorphemic transparency and of the morphemic transparency, which seem incremental to the year of study of the language and in partial correlation with each other.
Chapter
Full-text available
While Chinese as a second/foreign language acquisition is a relatively young discipline, little is the research on the acquisition of syntactic aspects conducted on Italian L1 learners. This article offers an overview of the studies in this area: after outlining the field of investigation and some among the main theoretical and methodological approaches, it presents some core tools, goals, and notions of the discipline. Then, it offers a brief review of some of the main studies conducted on Italian L1 learners of Chinese, with a focus on syntactic and discourse aspects that Italian and Chinese share or differ for. / Se di per sé la ricerca sull'acquisizione del cinese come lingua seconda/straniera è una disciplina relativamente giovane, ciò è ancor più vero per gli studi acquisizionali su aspetti di natura sintattica condotti su apprendenti di madrelingua italiana. Questo articolo si propone di tracciare una panoramica della ricerca in quest'ambito: dopo aver delineato il campo di indagine e i principali approcci teorico-metodologici della disciplina, ne presenta alcuni tra i principali strumenti, obiettivi e nozioni; offre poi una breve rassegna di alcuni studi condotti su apprendenti italofoni, soffermandosi nello specifico su caratteristiche sintattiche e del discorso che accomunano o differenziano l'italiano e il cinese.
Chapter
Full-text available
This chapter deals with the combined use of learner corpus data and experimental data to gain a better understanding of learner language and how it is acquired. It presents the advantages of such a combination and some of its challenges. It also describes the experimental methods that have most often been combined with learner corpus analyses. Examples of studies that have successfully combined learner corpus data and experimental data are provided. The chapter advocates the use of more – and more diversified – multimethod approaches and suggests that this could contribute to the theoretical rapprochement between learner corpus research and second language acquisition.
Chapter
Full-text available
Relying on a cognitive-functional theoretical framework, the study demonstrates how an explicit teaching of the Principle of Temporal Sequence can be useful for Mandarin Chinese learners to discern the different preverbal or postverbal position of the prepositional locative phrase ‘zài 在+ location’ within the unmarked sentence. Data collected through a cross-sectional investigation in an Italian Secondary School showed that this teaching approach is preferable to those traditionally used in the classroom.
Book
Full-text available
I nove contributi inclusi in questo volume hanno lo scopo di presentare teorie e dati relativi alla didattica e all’acquisizione del cinese come lingua straniera. Le ricerche selezionate coprono i principali ambiti della didattica e dell’acquisizione del cinese: fonetica, scrittura, lessico, sintassi e pragmatica. Ciascuna introduce l’oggetto di ricerca presentando la letteratura di riferimento in modo da offrire al lettore non solo una panoramica degli studi già condotti ma anche i termini e le nozioni utili ad affrontare i diversi ambiti dell’apprendimento del cinese, da quelli più estesamente trattati come scrittura e fonologia a quelli affrontati solo più di recente, come lessico e pragmatica. La maggior parte dei lavori presenta dati originali e contribuisce pertanto a delineare le caratteristiche dell’apprendente italofono di cinese e a suggerire possibili proposte per superare le criticità riscontrate o colmare i vuoti che inevitabilmente si creano nella didattica di questa lingua.
Article
Full-text available
The emergence of the Chinese learner corpora to great extent facilitates the research embedded in the field of teaching Chinese as a second language. There are numerous existing Chinese learner corpora mainly created at Chinese institutions up to two decades ago composed of texts written by learners mainly from Asian countries. The small-scale Chinese learner corpus focused on a particular group of learners enables the research of the interlanguage development based on the acquisition process of Slovak learners and the error analysis is helpful for the determination of the learner's main problems. The creation of the first Chinese learner corpus of Slovak students enriches the research devoted to the methodology of Chinese taking into consideration the needs of Slovak students due to the limited availability of the teaching materials in the language combination Chinese-Slovak.
Preprint
Learner corpus collects language data produced by L2 learners, that is second or foreign-language learners. This resource is of great relevance for second language acquisition research, foreign-language teaching, and automatic grammatical error correction. However, there is little focus on learner corpus for Chinese as Foreign Language (CFL) learners. Therefore, we propose to construct a large-scale, multidimensional annotated Chinese learner corpus. To construct the corpus, we first obtain a large number of topic-rich texts generated by CFL learners. Then we design an annotation scheme including a sentence acceptability score as well as grammatical error and fluency-based corrections. We build a crowdsourcing platform to perform the annotation effectively (https://yaclc.wenmind.net). We name the corpus YACLC (Yet Another Chinese Learner Corpus) and release it as part of the CUGE benchmark (http://cuge.baai.ac.cn). By analyzing the original sentences and annotations in the corpus, we found that YACLC has a considerable size and very high annotation quality. We hope this corpus can further enhance the studies on Chinese International Education and Chinese automatic grammatical error correction.
Chapter
Il presente studio analizza le situation-bound utterances del cinese e dell’italiano secondo un’ottica comparativa, allo scopo di individuare i maggiori ostacoli per apprendenti italofoni di cinese lingua straniera. Dalle risposte a un discourse completion task sottoposto a 65 parlanti nativi di cinese, 85 parlanti nativi di italiano e 49 apprendenti italofoni è emerso che: (i) il cinese richiede un’espressione (semi-)fissa dominante nella maggior parte degli scenari, mentre l’italiano presenta maggiore varietà; (ii) le risposte degli apprendenti presentano un ampio grado di variabilità, sono caratterizzate da verbosità e semplificazione e appaiono influenzate da fattori quali lo stadio dell’interlingua, il transfer dalla L1 e il contesto d’apprendimento.
Article
Due to their light-weight appearance and polysemy, the acquisition of Chinese sentence-final particles (SFPs) constitutes a criticality for learners of Chinese as a foreign language (CFL). However, the number of studies addressing SFP acquisition and teaching is still limited. This study investigates the use of SFPs in the interactions between 13 Italian CFL learners and 6 native speakers of Chinese participating in a face-to-face tandem-learning project over a three-month timespan. In particular, it focuses on learners’ production of SFPs marking questions (yes/no or truncated) and analyses the factors that foster or hinder SFP use. The qualitative and quantitative analysis of the transcribed conversations showed that (i) the most frequently produced SFP was ma 吗, whereas other interrogative SFPs were seldom or never used; (ii) the production of SFPs did not vary over time, instead it seemed to be tied to factors such as the presence of (semi-)fixed chunks or the type of task (focused or unfocused) in which the participants were engaged.