How radical is pro-drop in
A quantitative corpus study on referential
choice in Mandarin Chinese
in der Fakultät Geistes- und
Verfasserin: Maria Carina Vollmer
Prüfer: Prof. Dr. Geoﬀrey Haig
Zweitprüferin: PD Dr. Sonja Zeman
List of Abbreviations 3
List of Figures 4
2.1 Pro-drop........................... 8
2.2 Radical pro-drop . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Free distribution of zero arguments . . . . . . . . 14
2.2.2 Frequency of zero arguments . . . . . . . . . . . . 17
2.3 Referential choice . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Factors inﬂuencing referential choice . . . . . . . 21
220.127.116.11 Syntactic Function . . . . . . . . . . . . 21
18.104.22.168 Animacy . . . . . . . . . . . . . . . . . 22
22.214.171.124 Topicality . . . . . . . . . . . . . . . . . 23
126.96.36.199 Person . . . . . . . . . . . . . . . . . . . 25
188.8.131.52 Antecedent-related factors . . . . . . . . 26
2.3.2 Referential choice in Mandarin . . . . . . . . . . . 28
2.4 Interim conclusion . . . . . . . . . . . . . . . . . . . . . 29
3.1 Research questions and hypotheses . . . . . . . . . . . . 32
3.2 Thecorpus ......................... 34
3.2.1 Multi-CAST (Haig & Schnell 2019) . . . . . . . . 35
3.2.2 Languages...................... 37
3.2.3 Mandarin ...................... 39
184.108.40.206 Jigongzhuan (jgz) ............ 40
220.127.116.11 Liangzhu (lz)............... 41
18.104.22.168 Mulan (ml)................ 41
22.214.171.124 Corpus annotation . . . . . . . . . . . . 42
3.3 Quantitative Analysis . . . . . . . . . . . . . . . . . . . . 49
4.1 Frequency of zero arguments . . . . . . . . . . . . . . . . 56
4.1.1 Distribution of noun phrase, pronoun and zero . . 57
4.1.2 Distribution in diﬀerent syntactic functions . . . . 59
4.1.3 Frequency of only pronoun and zero . . . . . . . . 64
4.1.4 Interim discussion and conclusion . . . . . . . . . 65
4.2 Probabilistic constraints . . . . . . . . . . . . . . . . . . 69
4.2.1 Mandarin ...................... 69
4.2.2 All languages . . . . . . . . . . . . . . . . . . . . 74
4.2.3 Interim conclusion . . . . . . . . . . . . . . . . . 78
List of Abbreviations
1sg First person singular
2sg Second person singular
3sg Third person singular
1pl First person plural
2pl Second person plural
3pl Third person plural
asp Aspectual marker
mp Modal particle
neg Marker of negation
np Noun phrase
List of Figures
1 Multi-CAST Languages. . . . . . . . . . . . . . . . . . . . . 37
3 Occurrence of zero in the diﬀerent languages (%). . . . . . . 58
4 Occurrence of pronouns in the diﬀerent languages (%). . . . 59
5 Occurrence of noun phrases in the diﬀerent languages (%). . 60
10 Percentages of zeros in all functions except for subject. . . . 65
11 Percentages of zeros in object function. . . . . . . . . . . . . 66
12 Percentages of zeros in comparison with pronouns. . . . . . . 67
13 Decision tree for variation between noun phrases and zero
arguments in Mandarin. Pronouns are included in the data
but unused by the algorithm. . . . . . . . . . . . . . . . . . 70
14 Decision tree for variation between pronouns and zero argu-
15 Decision tree for referential choice in all languages. . . . . . 75
16 Decision tree between pronoun and zero for all languages. . . 77
One of the most famous characteristics of Mandarin Chinese is its char-
acterisation as a radical pro-drop language (Roberts & Holmberg 2009:
9, Neeleman & Szendröi 2007, Liu 2014). The term pro-drop refers to
the phenomenon when languages admit the possibility that referential
forms in in a clause are not realised overtly but are dropped instead.
The hearer thus has to infer from context to which referent the argument
refers. While other languages also exhibit this so-called pro-drop, Man-
darin supposedly has a much wider scope of zero arguments and makes
extensive use of them. This phenomenon is often referred to as radical
pro-drop (Roberts & Holmberg 2009: 8ﬀ, Ackema & Neeleman 2007:
83 Ackema et al. 2006: 5, Barbosa 2011a, Neeleman & Szendröi 2007,
Studies of pro-drop are important in many respects. First of all,
pro-drop is used as a characteristic in typological classiﬁcations of lan-
guages in several regards, i.e. in the context of topic-prominent lan-
guages (Li & Thompson 1976) and non-configurationality (Hale
1983: 5). Ackema et al. (2006: 16) comment that
[...] one may even wonder whether pro-drop languages
do in fact have a structural subject position, or whether in
such languages apparent subjects are really optional additions
to the clause in a dislocated position. If so, this would be
reminiscent of the behaviour of all syntactic noun phrases in
non-conﬁgurational languages. The question is, then, to what
extent pro-drop languages are non-conﬁgurational.
Understanding pro-drop and referential choice also plays a role in
understanding how to identify the referent of an omitted or pronominal
argument for machine translation or research on the processing costs
of anaphoric expressions (e.g. Gelormini-Lezama 2018). It is thus also
relevant in computer sciences, since it is crucial for translation machines
to correctly identify the referent of a pronoun or dropped argument (see
e.g. Zhang et al. 2019, Wang et al. 2017, Soares 2016 for research on
machine translation between pro-drop and non-pro-drop languages).
Even though pro-drop is relevant in many respects, there has been
no large-scale quantitative study on the frequency of zero arguments in
Mandarin in comparison to a larger group of typologically and geograph-
ically diverse languages. This kind of study has only become possible in
recent years, since larger corpora of languages with consistent annota-
tions (of zero arguments) have only become available now. This thesis
responds to this research gap and the new possibilities with regard to cor-
pus studies and statistical software, and aims at quantitatively analysing
the frequency of zero arguments in Mandarin Chinese in natural spoken
language in comparison to other languages, using Multi-CAST, the The
Multilingual Corpus of Annotated Spoken Texts (Haig & Schnell 2019).
It also aims at shedding light on how speakers choose between noun
phrases, pronouns and zero arguments, i.e. on which factors inﬂuence
their referential choice, and it compares these results to other languages.
In the next section, I will give an overview of the theoretical back-
ground, i.e. how (radical) pro-drop has been discussed in the literature,
how the terms pro-drop and radical pro-drop came into being and
I will provide deﬁnitions of the most important notions. A classiﬁcation
of diﬀerent types of pro-drop will be given. I will also sum up what
might inﬂuence referential choice according to diﬀerent studies. In Sec-
tion 3, I will introduce my research questions and hypotheses, then turn
to the data I used and collected, give an overview of Multi-CAST (Haig
I included Mandarin into the corpus and which methodological choices I
had to make during that process. I will also explain which quantitative
methods I use to analyse the data available to me. The results are then
described in Section 4. Namely, I ﬁrst count the frequency and rate of
zero arguments in Mandarin and then compare the results to the other
languages. I then use decision trees to predict referential choice in Man-
darin according to certain variables and compare these trees to the other
languages in the corpus. The results are discussed in Section 5 and put
into perspective with respect to current literature. I give a conclusion
and examine problems and questions that my study has uncovered and
on which further research could focus (Section 6).
In this section, I will give a short overview of the theoretical background
of pro-drop, radical pro-drop and referential choice. I will ﬁrst sum up the
research history on pro-drop, its deﬁnition and its diﬀerent classiﬁcations.
Since a review of all the literature available on this topic, especially
in the domain of Generative Grammar, would go far beyond the scope
of this thesis, only the most important observations and claims will be
summarised. A special focus will be given to radical pro-drop, which
is claimed to be a feature of Mandarin Chinese. Thus I will dive more
deeply into radical pro-drop and give a short overview of its deﬁnition,
of the claims made about radical pro-drop in the literature, and explain
how Mandarin ﬁts into the picture.
I will then concentrate on a diﬀerent but related question, namely
referential choice. I will give an overview of its research history and
variables claimed to inﬂuence it in diﬀerent languages. In the end, I
will turn to Mandarin and give an overview of studies and claims on
referential choice and its variables speciﬁcally in Mandarin.
pro-drop refers to the covert realization of a core argument and the
distinction between languages that (routinely) allow the omission of ar-
guments, and languages in which the omission of an argument is usually
For ins tan ce, subjec ts1are regularly omitted in Spanish (1), while the
same construction would be ungrammatical in English2(2) (Roberts &
Holmberg 2009: 4, Huang 1984: 532):
(1) Habla español. (Roberts & Holmberg 2009: 4)
(2) * Speaks English.(Roberts & Holmberg 2009: 4)
The ﬁrst to make this distinction in Generative Grammar was Perl-
mutter (1971) albeit only for subjects (Roberts & Holmberg 2009: 3f.).
He proposed that languages can be classiﬁed into ‘Type A’ and ‘Type
B’ languages (Perlmutter 1971: 115), depending on their permission of
zero subjects, and that there was a correlation of this property of a lan-
guage with other properties, e.g. that trace eﬀects and WH-movement,
The term ‘pro-drop’ to describe this phenomenon was ﬁrst coined in
Chomsky’s (1981) Government and Binding Theory (Ackema et al. 2006:
2, Barbosa 2011b). Subsequently, it became a highly-debated topic in
Generative Studies, more precisely within the framework of Principles
and Parameters (e.g. Wratil 2011, Sessarego & Gutiérrez-Rexach 2017,
Barbosa 2009, Speas 2006, Koeneman 2006, Bennis 2006, Adams 1987,
Barbosa 2011a; see Battistella 1985, Huang 1984, Huang 1992 and Liu
2014 for their discussions of Chinese). This framework assumes that
there are certain universal principles of Universal Grammar (UG) which
1Note that in Spanish, only the subject may be omitted, while the object is
obligatorily overt (Huang 1984: 532, Liu 2014: 4).
2Note that it is possible or obligatory to omit the subject in certain English
constructions (Huang 1984: 532).
Table 1 : Rich verbal inﬂection i n Ita lia n versus low verbal inﬂectio n in
English, adapted from Ackema & Neeleman (2007: 82)
1pl parl-iamo speak
2pl parl-ate speak
3pl parl-ano speak
represent the innate grammatical knowledge of a child and make it pos-
sible for it to acquire language as fast as it does (Wratil 2011: 47). These
principles are claimed to vary depending on certain parameters, the pro-
drop or null subject parameter being one of these (Chomsky 1981: 231-
284, Wratil 2011: 47, Koeneman 2006: 76; see contrary views in Bennis
The stereotypically analysed pro-drop languages Italian and Spanish
show rich verbal inﬂection whereas the stereotypically analysed non-pro-
drop languages English and French do not (see Table 1 for a comparison
of Italian and English). Thus the (dropped) subject is coreferenced on
the verb in Italian and Spanish, but not in English and French. This has
led Generative Grammar to claim that rich verbal inﬂection plays a role
in pro-drop (Ackema & Neeleman 2007: 82, Koeneman 2006: 76f., Bennis
2006: 101, Fuß 2011: 53, Huang 1984: 534, Huang 1992: 9, Neeleman &
Szendröi 2007: 671, Liu 2014: 3, Travis & Cacoullos 2012: 733). This
hypothesis was underlined by diachronic studies, e.g. on the loss of pro-
drop in Old French coinciding with the loss of inﬂectional endings on the
verb (Ackema & Neeleman 2007: 82, Wratil 2011: 103ﬀ, see e.g Fuß 2011
for a more critical diachronic study).
However, subsequent research revealed that languages can allow the
dropping of both subject and object even though the verb shows no
or little agreement with either (e.g. Ackema & Neeleman 2007 for Early
Modern Dutch). The most prominent example of this so-called radical
or discourse pro-drop (Ackema & Neeleman 2007: 83, Ackema
et al. 2006: 5, Barbosa 2011a, Liu 2014) is Mandarin, but it is not the
only language displaying this feature (Ackema & Neeleman 2007: 83),
which poses a signiﬁcant problem for the hypothesis that pro-drop and
rich agreement systems go hand in hand (Huang 1984: 537, Huang 1992:
9, Neeleman & Szendröi 2007: 672).3
Vario us ot her characterist ics a re cl aim ed to b e conne cted to the pro-
drop parameter, namely free subject inversion, WH-movement and so-
called that trace eﬀects (e.g. White 1985, Bennis 2006: 102). Free subject
inversion refers to the phenomenon where subjects can occur postverbally
as well as preverbally, e.g. in Italian, as in (3) (Roberts & Holmberg 2009:
(3) Hanno telefonato molti studenti. (Roberts & Holmberg 2009: 16)
The that trace eﬀect refers to the fact that sentences like (4) are
ungrammatical in English, whereas they are not in null subject languages:
(4) * Who did you say that ø wrote this book? (Roberts & Holmberg
However, the correlations between pro-drop and these two phenomena
have been questioned (Bennis 2006: 102). With time, studies on pro-
drop showed that there are more nuanced diﬀerences between languages
regarding pro-drop, e.g. what kinds of arguments can be dropped, the
3See also Fuß (2011) for a critical view from a diachronic perspective.
amount of verbal inﬂection co-referencing the argument, and in which
constructions arguments can be dropped. For instance, a distinction
must be drawn with regard to which core arguments may be dropped
in a language: There are referential core arguments, and non-referential
(5) It rains.
(6) He eats.
In (5), the pronoun is non-referential as it does not refer to any spe-
ciﬁc entity, whereas the pronoun in (6) refers to a speciﬁc human en-
tity, namely the person that is eating. Some languages allow both non-
referential and referential arguments to be dropped (e.g. Mandarin),
while others only allow non-referential ones to be dropped (e.g. Finnish)
(Ackema et al. 2006: 12).
This has led to a distinction between diﬀerent types of pro-drop lan-
guages, which are listed below, sorted according to their scale of freedom
in allowing pro-drop4:
1. Non-null subject languages, e.g. English,
2. Expletive or semi-null subject languages (Roberts & Holmberg 2009:
8). This corresponds to the distinction between referential and non-
referential arguments made above. In some non-pro-drop languages
that generally do not allow omission of subject, it is possible to omit
3. Partial null subject languages (Roberts & Holmberg 2009: 10ﬀ.,
Rosenkvist 2010, Koeneman 2006, Koeneman 2006: 77), namely
4This hierarchy is adapted from Roberts & Holmberg (2009: 12).
languages in which pronouns may be omitted, but only under cer-
tain conditions, e.g. only the ﬁrst and second person.
4. Consistent null subject languages (Roberts & Holmberg 2009: 6),
which are historically the ﬁrst languages claimed to be pro-drop
languages and the languages which have received the most atten-
tion. The subject can be omitted in all tenses and persons, and
verbal inﬂection is usually rich. Consistent null subject languages
typically also exhibit the above-discussed free subject inversion and
that trace eﬀects (Roberts & Holmberg 2009: 16, White 1985, Ben-
nis 2006: 102).
5. Discourse or radical pro-drop languages (Roberts & Holmberg 2009:
8ﬀ, Ackema & Neeleman 2007: 83 Ackema et al. 2006: 5, Barbosa
2011a, Neeleman & Szendröi 2007, Liu 2014) are languages that
freely allow pro-drop (namely in all constructions and in all syn-
tactic functions) but do not exhibit rich verbal inﬂection.
In recent years, there have been new developments corresponding to
the availability of large amounts of corpus data, e.g. Multi-CAST (Haig
This makes it possible to conduct studies on the rate of overt and covert
anaphora in diﬀerent languages (e.g. Bickel 2003, Stoll & Bickel 2009),
and even to conduct probabilistic analyses of referential choice in natural
language use (e.g. Torres Cacoullos & Travis 2019, Schiborr 2018, Schnell
& Barth 2018, Travis & Cacoullos 2012). Bresnan et al. (2005: 2) note
[t]heoretical linguists have traditionally relied on linguis-
tic intuitions such as grammaticality judgments for their data.
But the massive growth of computer-readable texts and record-
ings, the availability of cheaper, more powerful computers and
software, and the development of new probabilistic models for
language have now made the spontaneous use of language in
natural settings a rich and easily accessible alternative source
of data. (Bresnan et al. 2005: 2)
I will now turn to radical pro-drop and explain how this notion is
connected to Mandarin. I will give an overview of claims about Mandarin
in the literature, outline why it is believed to be extraordinary with regard
to pro-drop and note where research gaps are to be ﬁlled.
2.2 Radical pro-drop
Mandarin has played a prominent role in research on pro-drop languages
and is one of the typical examples of so-called discourse or radical pro-
drop languages (Roberts & Holmberg 2009: 9, Neeleman & Szendröi
2007, Liu 2014). This is due to two claims about the radicality of pro-
drop in Mandarin, which will be demonstrated in the next two sections
(2.2.1 and 2.2.2).
2.2.1 Free distribution of zero arguments
The distribution of zero arguments in Mandarin is very free even though
there is no verbal agreement with any core arguments: it not only in-
cludes subjects – which remain the main emphasis of research on pro-drop
to date – but also arguments in any other syntactic function, e.g. objects
(Battistella 1985: 324, Roberts & Holmberg 2009: 9, Huang 1984: 533,
Neeleman & Szendröi 2007: 672, Liu 2014), as illustrated in Examples
(7) and (8) below.
‘He saw him.’ (Roberts & Holmberg 2009: 9)
‘He saw him.’ (Roberts & Holmberg 2009: 9)
Interestingly, Huang (1984) notes that in radical pro-drop languages
there seem to be other mechanisms and constraints at work to identify
reference. This refers to what he calls “subject-object asymmetry”, and
describes the fact that objects in Mandarin dependent clauses are not
bound in reference to the matrix clause, but to the discourse context,
while Mandarin subjects and English subjects and objects are bound to
the matrix clause (Huang 1984: 541).
(9) Speaker A:
‘Who saw Zhangsan?’ (Huang 1984: 539)
(10) Speaker B:
‘Zhangsan said Lisi saw him.’ (Huang 1984: 539)
The claim Huang (1984: 539) makes here is that the reference of the
omitted object in (10) is not automatically Zhangsan if the sentence is
uttered out of context; rather, a hearer would prefer a reading where
it is not Zhangsan. Only the discourse context (Example 9) makes a
reading possible where the referent of the omitted object is Zhangsan.
Comparing this to English, we see that the reference of an object pronoun
is bound to the matrix clause:
(11) John said that Bill knew him. (Huang 1984: 538)
In Example (11), the pronoun is interpreted as referring to John if the
sentence is uttered without discourse context (Huang 1984: 539).
Note that in my view, discourse plays a role in English as well as
in Mandarin, since the pronoun in both languages could have a number
of diﬀerent referents depending on discourse context. In addition, one
should be cautious when relying on discourse-free utterances, since these
are highly unnatural and thus do not represent actual language use. This
is also one of the critical points that Yan Huang (1992) makes against
James Huang (1984): “there is, in my opinion, no such thing as a prag-
matically neutral linguistic example, since we understand the meaning
of a linguistic example only against a set of background assumptions”
(Huang 1992: 23).
Yet, he agrees that zero arguments in Mandarin are ultimately “not
grammatically but pragmatically determined” (Huang 1992: 27), com-
pared to languages like English, in which zero anaphors are grammat-
ically determined by sentence structure. This raises the question what
pragmatic factors inﬂuence referential choice, which will be taken up in
the next section.
Some researchers have claimed that the pragmatic factors inﬂuenc-
ing referential choice might diﬀer between pro-drop types. In their view,
there is a fundamental diﬀerence between pro-drop languages and radical
pro-drop languages with regard to referential choice. For instance, be-
cause of the poor verbal inﬂection mechanisms, other mechanisms than
co-reference with the verb might determine reference, and other inﬂu-
ences might play a role in the choice between noun phrase, pronoun and
zero. For instance, Li & Bayley (2018: 137) note that referential choice
in Mandarin may be governed by other factors than in other languages.
2.2.2 Frequency of zero arguments
The comparison of the rate of zero arguments in languages is called ref-
erential density by Bickel (2003: 708). Based on renarrations of
the pear stories Chafe 1980, Bickel (2003: 708) analysed and compared
Belhare, Nepali and Maithili, and found “a statistically signiﬁcant diﬀer-
ence between referential density means in narratives across speakers of
diﬀerent languages” (Bickel 2003: 732).
With regard to Mandarin, there is a persistent claim that zero argu-
ments are very frequent (Li & Thompson 1979, Huang 2000, Yang et al.
2003: 287). For instance, Battistella (1985: 324) claims that they are a
“pervasive feature of Chinese”, and Bickel (2003: 708) notes that “Chinese
discourse [...] is well known for often being very implicit about referents
compared to other pro-drop languages”. Similarly, Pu (1997: 281) writes
[u]nlike English which uses anaphoric pronouns extensively
and zero anaphora in syntactically more constrained circum-
stances, Chinese makes a much lesser use of lexical pronouns
in tracking reference and a principal use of zero anaphora in
This has led some researchers to even claim that languages like Man-
darin are fundamentally diﬀerent from other languages, as pragmatics
plays a much larger role in grammar than in other languages. Huang
(2000: 261-277) claims that there are “pragmatic languages” and includes
Mandarin in this list.
This claim about Mandarin has not been tested quantitatively in a
large-scale comparative study, with only a few recent studies that con-
centrate on probabilistic analyses of referential choice (e.g. Li & Bayley
2018, Pu 1997, Li 2012, Pu 1995) and has mostly been based on intuitive
claims on Mandarin made in Generative Grammar.
In the next section, I will give an overview of referential choice, factors
claimed to inﬂuence it and studies on referential choice in Mandarin.
2.3 Referential choice
As noted in the section above, a related question on radical pro-drop is
how referential choice is distributed in discourse, namely when a speaker
chooses to use a lexical noun phrase, a pronoun or a zero argument.
There are two diﬀerent approaches to pro-drop, namely the genera-
tive parametric approach discussed above, and the usage-based approach
taken by e.g. Bickel (2003), and Schnell & Barth (2018: 58).
The usage-based approach tries to generalise rules about pro-drop
based on corpus linguistics (Schnell & Barth 2018: 58), which is the
approach I am choosing in this thesis. This approach has the advan-
tage that it analyses actual spoken language within its discourse context,
while the generative parametric approach relies on intuition and indi-
vidual utterances taken out of context. This is often not suitable for
explaining a speaker’s subconscious probabilistic choice, especially when
more than one variable makes an impact and when choices are depending
on the discourse context. The study of referential choice is related to the
speaker’s choice between the three possible forms an argument can take:
When a speaker is using language, grammar cues him/her
to particular choices: which word order to use, where to place
a discourse marker and which one, etc. Thus grammar is actu-
ally a system that guides a speaker’s choices. Some choices are
relatively rule-based, whereas other choices are rather proba-
bilistic. Discourse-related choices mostly belong to the latter
kind: a certain option is not strictly required or strictly ruled
out, and more than one option is to a certain extent permis-
sible. (Kibrik 2011: 15)
This is also the case with referential choice; in Mandarin, none of
the three forms (NP, pro, zero) would render a sentence ungrammatical;
rather, diﬀerent forms are more or less appropriate depending on context.
While understanding the probabilistic rules behind this would be very
interesting, they are on the other hand very hard to research, since one
cannot rely on introspection to determine these rules. Thus the aim of
this thesis is to proﬁt from recent developments in corpus linguistics and
statistical advances in linguistics and use probabilistic methods to predict
which form a speaker is most likely to use.
This is connected to claims that referential choice is inﬂuenced by
diﬀerent factors in diﬀerent languages, possibly corresponding to the
above already-mentioned diﬀerent classiﬁcations of pro-drop. For in-
stance, Ackema et al. (2006: 16) write that “[...] it turns out that the
syntax of subjects in pro-drop languages deviates from that of subjects in
non-pro-drop languages in a number of respects.” Speciﬁcally, it has been
claimed that Mandarin, as a radical pro-drop language, might act diﬀer-
ently to other languages regarding referential choice. Some researchers
have suggested that this diﬀerence is due to the diﬀerent verbal marking
in these languages: Ackema & Neeleman (2007) claim that since Early
Modern Dutch has little verb agreement that could grammatically refer
to the dropped argument, pragmatic conditions play a larger role in de-
termining the referential choice of either pronoun or zero in Early Modern
Dutch. The dropped argument must then be more salient in discourse
than in Italian-style pro-drop languages, the analysis of which Ackema &
Neeleman (2007) base on Accessibility Theory (see Section 126.96.36.199).
Ackema et al. (2006: 15) claim that in Mandarin, a dropped argu-
ment is the topic of the discourse. They believe that a diﬀerence must
be drawn between languages with rich verbal inﬂection that allow pro-
drop when the argument remains identiﬁable through coreference on the
verb (‘pro-drop’), and languages that only allow pro-drop when the
argument is topic and can be identiﬁable through the discourse context
Many studies, however, point in the direction that all languages have
the same constraints, e.g. Pu (1995) for English and Mandarin, or Tor-
res Cacoullos & Travis (2019) for English and Spanish:
Rates of use are not a reliable comparison measure. De-
spite the conspicuous rarity of unexpressed subjects in En-
glish compared with Spanish, there is structured variabil-
ity within this non-null subject language, which, contrary to
cherished belief, displays striking parallels with variation pat-
terns in the null-subject language. (Torres Cacoullos & Travis
In the remainder of this section, I will give an overview of factors
claimed to have an impact on referential choice in the literature. This
will then be the basis for the variables I will look at in my analysis of
2.3.1 Factors inﬂuencing referential choice
There are several variables claimed to inﬂuence referential choice. Some
of these are dependent on the discourse context, e.g. antecedent dis-
tance, while others are independent of the discourse context, as these are
inherent properties of the referent, e.g. animacy.
Since too many factors have been proposed in the literature to include
in this description, I limit myself to the most important ones. Factors
that have been mentioned in the literature but will not be discussed here,
include discourse segmentation (Giora & Lee 1996: 114), deﬁniteness
(Ariel 1996: 22), constructions with speciﬁc semantic verb classes (Travis
& Cacoullos 2012: 725), and backgrounded tense-aspect-mood in Spanish
(Travis & Cacoullos 2012: 725).
188.8.131.52 Syntactic Function
According to Du Bois (1987) and Du Bois (2003), information is dis-
tributed in discourse in an ergative pattern, and this distribution is what
gives rise to ergative languages. Du Bois (2003: 34) posits four con-
straints on Preferred Argument Structure:
1. Avoid m ore th an one l ex ica l core a rgum ent .
2. Avoid l exic al A. (s ee a lso D u Bois 1 987: 82 3)
3. Avoid m ore th an one n ew c ore a rgum ent . (s ee al so D u Boi s 1987 :
4. Avoid n ew A. (s ee als o Du B ois 1 987: 82 7)
This would imply that the syntactic function of the referent plays a
role in referential choice, with transitive subjects rather being pronominal
or zero than lexical full noun phrases.
Du Bois (1987: 829) believes that these eﬀects are eﬀects of topic con-
tinuity, since the topic is prototypically human, in the A role, and given.
This claim thus correlates with the factors of animacy and topicality,
Recently, Haig & Schnell (2016) showed that while the A role does
have a low rate of full noun phrases, this can be better explained with
the variable of animacy, thereby questioning the claims of Du Bois (1987)
and Du Bois (2003). This variable will be discussed in the next section.
Roughly speaking, animacy denotes the distinction between human, an-
imate and inanimate entities (Dahl & Fraurud 1996). In this thesis,
animacy as a variable is only concerned with the binary distinction be-
tween human and non-human referents. It is an inherent property of the
referent and independent of the discourse context, and has been claimed
to play a role in referential choice, e.g. by Fraurud (1996), Ariel (1996:
22) and Hsiao et al. (2014).
Hsiao et al. (2014) showed that subject omissions in Mandarin are
higher when both the subject and the object are animate, while they are
lower when the object is inanimate. Out of the 3810 transitive clauses
analysed in their study, 2445 had an overt subject, and 1365 contained
In contrast, Schnell & Barth (2018) found that animacy could not
correctly predict the choice between pronominal and zero objects. The
authors used texts from diﬀerent registers, which enabled them to com-
pare animate and inanimate discourse topics, and found that discourse
topicality, rather than animacy, plays a role in referential choice. The
connection between animacy and topicality was also drawn by Pu (1997):
Among various semantic, pragmatic and discourse factors,
animacy (+/-hum) seems to strongly aﬀect the syntactic
coding of a referent because in narrative discourse, a refer-
ent that is human is more often topical, agentive, given and
deﬁnite than a non-human referent, and is more likely to be
coded by grammatical subject and hence zero anaphora. (Pu
In line with the study of Schnell & Barth (2018), Pu (1997: 290) found
that animacy (+/-hum)increasesthelikelihoodofpronounsincontrast
to zeros in both English and Mandarin, but noted that this eﬀect was
especially high when the referent was topical , while it was low when the
referent was not topical.
As was already shown in the section above, topicality is a widespread
notion in information structure, and has been claimed to play an impor-
tant role in referential choice (Ariel 1996: 22, Schnell & Barth 2018: 73,
Huang 1984: 541). However, its deﬁnitions vary strongly.
Most importantly, topicality of a referent can either refer to the dis-
course topic, namely the topic of a certain narrative and “that discourse
entity that an entire text is about and that makes the text interesting”
(Schnell & Barth 2018: 59), or to the sentence topic, which stays within
the boundaries of a sentence, and is often discussed in connection with
so-called topic-prominent languages, in which topics are claimed to be
an important feature of the grammar and which might even have topics
(and comments) rather than subjects (and predicates) as their most basic
clause structure (e.g. Li & Thompson 1981: 15f., 85ﬀ.).
These two diﬀerent kinds of topicality need to be kept distinct when
discussing its inﬂuence on referential choice:
This suggests that discourse topicality and sentence topi-
cality do not converge when it comes to the use of pronouns
for objects in Vera’a, and discourse topicality is a factor in
its own right, distinct from the pragmatic relation of topic
within a sentence. (Schnell & Barth 2018: 73)
Schnell & Barth (2018) found that discourse topicality was the best
predictor for the choice between zero and pronoun in Vera’a. Discourse
topics in Vera’a are more likely to be realised pronominally (Schnell &
Barth 2018: 59). However, since discourse topicality is not explicitly
annotated in Graid (Haig & Schnell 2014) and RefIND (Schiborr et al.
2018), it cannot be analysed in this thesis directly.
An important question thus is how to analyse and deﬁne topicality in
the corpus. One possibility would be to assume that all human referents
are topics, as was done in Schnell & Barth (2018: 59): “for narratives,
we assume that all human or human-like referents [...] are the discourse
topics, [....]. And for the two types of descriptive texts, it is the ﬁsh or
plant species, respectively.” Accordingly, animacy could be analysed as
being an indirect indicator of topicality (1). A second analysis of topi-
cality could propose that referents are more topical when they have been
mentioned more recently. This would correspond to antecedent distance
(2). Finally, one could count the most frequent referents in each text
and assign them topic status, which corresponds to overall frequency of
referents (3). These three variables are assumed to indirectly correlate
with the topicality of referents in this study.
Another factor claimed to inﬂuence referential choice is person (Wratil
2011: 119). This is an inherent property of the referent and independent
of the discourse context.
In some so-called partial pro-drop languages (see Section 3), e.g.
Finnish and Hebrew, only the ﬁrst and second person pronouns can be
omitted (Koeneman 2006: 100). In Vera’a, Schnell & Barth (2018: 74)
found that while there was variation between pronominal and zero forms
in the third person, all ﬁrst and second person mentions were pronominal.
Wrati l (20 11: 119) b elieves that ﬁrst p erson referents are more “topic-
worthy” than second person referents, which are in turn more “topic-
worthy” than third person referents. The more “topic-worthy” a referent
is, the more likely it is to be realised as an unmarked argument, in the
syntactic function of a subject, and to have the semantic role of an agent
(Wratil 2011: 119). This is because ﬁrst and second person referents
refer to the actual speech act participants (Wratil 2011: 119). Similarly,
Ariel (1996: 22) believes that speaker and addressee are inherently more
Null and overt pronouns in Mandarin carry diﬀerent information,
since the pronouns in Mandarin carry information on person and gen-
der (see also Gelormini-Lezama 2018: 387). It would thus be possible
that person aﬀects referential choice between pronouns and zero. Note
that full lexical noun phrases are always third person, and should thus
be excluded from any analysis of person as an inﬂuencing factor.
Li & Bayley (2018: 149) found that person and number played a
role in their study of subject omissions in Mandarin. Also, Li (2012:
107) found that singular subjects are more likely to be pronominal, while
plural subjects in Mandarin tend to be covert.
184.108.40.206 Antecedent-related factors
Antecedent-related factors are all dependent on the discourse context
because they do not concentrate on the anaphoric form itself but rather
on its antecedent, namely the last mention of the referent in discourse.
There are several variables that could be tested in this area, but in
order to adhere to the limited scope of my thesis, I will explore the in
my view most important one, namely antecedent distance, which is also
connected to topicality as discussed above.
It might prove fruitful to test other variables in the future, i.e. the
syntactic function of the antecedent (Gelormini-Lezama 2018: 387, Schnell
& Barth 2018, Travis & Cacoullos 2012), the referential form of the an-
tecedent (Schnell & Barth 2018, Travis & Cacoullos 2012, Torres Cacoul-
los & Travis 2019), and competition between potential referents (Ariel
1988: 65, Travis & Cacoullos 2012, Li 2012: 107).
The distance to the last mention of a referent is claimed to have an
impact on referential choice (e.g. Ariel 1996: 22, Ariel 1988: 65). This
factor is closely connected to Accessibility Theory.
Accessibility Theory assumes that referential choice is deter-
mined by what the speaker assumes to be the “degree of accessibility of
the mental entity for the addressee” (Ariel 1996: 20). The accessibility
of a referent positively correlates with the degree of markedness.
Accessibility Theory assumes that speakers and hearers carry mental
representations of referents, which can be more or less accessible de-
pending on certain factors, and are marked diﬀerently by the speaker
depending on their accessibility status (Ariel 1988: 80).
The way mental representations are marked is deﬁned in the Acces-
sibility Marking Scale:
(12) The Accessibility Marking Scale, taken from Ariel (1996: 30):
zero < reﬂexives < agreement markers < cliticized pronouns <
unstressed pronouns < stressed pronouns < stressed pronouns +
gesture < proximal demonstrative (+NP) < distal demonstrative
(+NP) < proximal demonstrative (+NP) + modiﬁer < distal
demonstrative (+NP) + modiﬁer < ﬁrst name < last name <
short deﬁnite description < long deﬁnite description < full name
With regard to Mandarin, then, the zero argument would be the
highest accessible one (Giora & Lee 1996: 113). Ariel (1996: 22) believes
that the more recently a referent has been mentioned in discourse, the
more accessible it is. This is one of the most central claims of Accessibility
Theory, and has been supported by Travis & Cacoullos (2012: 729), who
believe that their ﬁnding of the relevance of switch reference in referential
choice proves the usefulness of Accessibility Theory.
Contrary to these claims, Schnell & Barth (2018) found in their study
on referential choice in Vera’a that antecedent distance did not play any
role in the choice between pronoun and zero (but Accessibility Theory
might still be useful for lexical noun phrases).
Schnell & Barth’s (2018) quantitative probabilistic study on referen-
tial choice only between pronoun and zero in Vera’a thus does not lend
any support to Accessibility Theory; in fact, all variables that Acces-
sibility Theory would predict to have an eﬀect (antecedent distance,
discourse interruptions) did not prove to be relevant (Schnell & Barth
Finally, our ﬁndings provide ample counterevidence to the
universal relevance of accessibility and activation, suggesting
that at least the choice between pronoun and zero for objects
is not accountable for in terms of AT and similar frameworks
concerned with discourse structure. From an activation point
of view, these two forms of reference appear to be too similar
to mark signiﬁcant diﬀerences. (Schnell & Barth 2018: 76)
Consequently, accessibility theory and antecedent distance might play
an important role in the distinction between lexical noun phrases and
pronominal / zero arguments, while they seem to be irrelevant for the
distinction between pronominal arguments and zero arguments.
2.3.2 Referential choice in Mandarin
It has long been claimed, though on a purely intuitive basis, that Man-
darin behaves fundamentally diﬀerently from other languages, the fol-
lowing quotation being a response to these claims:
We reco gni ze that sp ea kers’ choic es b etween overt and
null pronouns are likely to pattern diﬀerently in a radical
pro-drop language like Chinese, which lacks verbal inﬂections
to indicate person and number, than in an inﬂected language
like Spanish. (Li & Bayley 2018: 137)
For a lon g tim e, t hes e claims had not be en te sted, but in recent
years, there have been more and more studies on referential choice and
probabilistic analyses outside the realm of GG, focusing on corpus stud-
ies. Some of these have circled around Mandarin as a radical pro-drop
language, e.g. Li & Bayley (2018), Pu (1997), Li (2012), Pu (1995).
Pu (1995) analyses anaphoric distribution in Mandarin and compares it
to English, concluding that they are subject to the same constraints in
anaphoric distribution in narratives (narrative production task, Pu 1995:
Pu (1997: 286) investigates pragmatic, semantic and discourse actors
in the distribution of zero anaphora, using the ﬁrst 25 pages of three con-
temporary Chinese novels. Li (2012) analysed speech from three diﬀerent
discourse contexts and used logistic regression (Li 2012: 102) to analyse
subject pronominal expression. They found that switch reference, per-
son, number, animacy, speciﬁcity and sentence type played a role, as well
as sociolinguistic factors of the speakers.
2.4 Interim conclusion
In this chapter, I have given an overview of pro-drop, radical pro-drop
and referential choice in general, and in Mandarin, speciﬁcally.
I have shown that languages can be grouped into diﬀerent classiﬁca-
tions with regard to the grammaticality and ungrammaticality of zero
arguments in diﬀerent constructions and positions. The most important
types for the purpose of this thesis are
1. non-pro-drop languages, e.g. English,
2. pro-drop languages,e.g. Spanish,and
3. radical pro-drop languages, e.g. Mandarin.
While zero arguments are mostly ungrammatical in a simple declara-
tive sentence in English, and can only occur in speciﬁc constructions that
grammatically force the omission of an argument or when the subjects
are co-referential, zero arguments are generally omitted and grammati-
cal in Spanish and Mandarin. Yet, in Spanish, these zero arguments are
limited to subjects, while all arguments can freely be omitted in Man-
darin. In addition, Spanish exhibits a rich verbal agreement system,
while Mandarin does not co-reference its core arguments on the verb.
Mandarin has played an important role in the literature (e.g. Bat-
tistella 1985, Huang 1984, Huang 1992, Liu 2014, Roberts & Holmberg
2009, Neeleman & Szendröi 2007) because a) it was claimed that it freely
admits the omission of all arguments (e.g. Battistella 1985: 324, Roberts
& Holmberg 2009: 9, Huang 1984: 533, Neeleman & Szendröi 2007:
672, Liu 2014), and b) that these zero arguments are very frequent, even
compared to other pro-drop languages (i.e. Li & Thompson 1979: 322,
Huang 2000: 262, Yang et al. 2003: 287, Battistella 1985: 324, Bickel
2003: 708, Pu 1997: 281).
While these claims persist to this day, there has been little quantita-
tive research to test them and compare Mandarin to other languages.
Referential choice is concerned with the question of how the speaker
chooses which form to use in discourse when all three are grammatically
possible: the full lexical noun phrase, the pronoun, or a zero argument.
There are persistent claims that Mandarin as a radical pro-drop language
acts more pragmatically than other languages and that choices in refer-
ential choice diﬀer from choices in other languages. These claims have
rarely been tested, and more quantitative studies on this are needed (see
e.g. Li 2012: 116 and Travis & Cacoullos 2012: 743 ):
The study of referent realization can be advanced through
the pursuit of accountable quantitative studies, in diﬀerent
language varieties; taking into consideration diﬀerent genres,
persons and syntactic roles; employing replicable operational-
izations of notions to be tested; exploring further the workings
of accessibility and the strength and interactions of priming
eﬀects; and identifying ﬁxed constructions which may exhibit
distinct behavior. (Travis & Cacoullos 2012: 743)
In order to test these claims in the subsequent chapters, I gave an
overview of potential drivers of anaphoric distribution discussed in the
literature. Among these are syntactic function, topicality (= animacy,
antecedent distance, overall frequency of referents), and person. Some of
these factors correlate with each other. For instance, referents are often
at the same time subjects, human, topical, and have low antecedent
distance. A statistical analysis should thus be suitable for this kind of
data and methods that can account for correlating variables should be
It was also shown that referential choice between noun phrases, pro-
nouns and zero arguments might be due to diﬀerent factors than the
distribution of pronoun and zero. This was shown in Schnell & Barth
(2018), where antecedent distance seemed to play a role in the choice
between noun phrases and non-lexical arguments, but not in the choice
between pronoun and zero; but also when looking at the factor of person,
which does not play a role for noun phrases, since they are always in the
Based on the above-discussed research gaps and claims about pro-
drop in Mandarin, I will now formulate the research questions and hy-
potheses, and then describe the corpus I will use to test the hypotheses.
As the collection and preparation of the Mandarin sub-corpus has been
part of this thesis, I will discuss it in greater detail, i.e. my methods of
collecting the data, and any methodological decisions I had to make while
annotating the data. I will then clarify the methods I use in the quantita-
tive analysis, how it ﬁts with the correlating variables, and some choices
I made regarding the exclusion and inclusion of certain data points.
In this chapter, I will present the research questions and hypotheses.
Then, I will give an overview of the data that I will be using for the quan-
titative study, namely the Multi-CAST corpus (Haig & Schnell 2019), in
which Mandarin has been included as part of this thesis.
After a summary of all the languages available in the corpus at the
time of analysis, I will describe the quantitative analysis and explain
the methods, software and packages that I use and the variables that I
consider for the probabilistic analysis of referential choice.
3.1 Research questions and hypotheses
From the reviewed literature in Chapter 2, two research questions arise
that I devote the analysis of this thesis to:
1. Is there a higher rate of zero arguments in Mandarin than in other
2. Which probabilistic constraints inﬂuence referential choice, and are
these constraints diﬀerent from constraints in other languages?
b) Syntactic function (Section 220.127.116.11)
c) Topic ali ty (Section 18.104.22.168)
i. Animacy (+/-hum)(Section22.214.171.124)
ii. Antecedent distance (Section 126.96.36.199)
iii. Overall frequency of referents (Section 188.8.131.52)
d) Person (only for the distinction between pronoun and zero
argument, Section 184.108.40.206)
The ﬁrst research question responds to the claim that the rate of zero
arguments in Mandarin is higher than in other languages. The second
research question aims at investigating if referential choice in Mandarin
is determined by other variables than in other languages.
The constraints that I want to test in the second research question are
inspired by what has been shown to be relevant in the literature before,
and what I have discussed in the previous chapter. The constraint of lan-
guage responds to the question if all languages have the same constraints,
or if they diﬀer in their constraints. If Mandarin really is fundamentally
diﬀerent from other languages, language will be one of the statistically
Syntactic function responds to the claims made by Du Bois (1987,
2003) but since Haig & Schnell (2016) have shown that this is actually
connected to animacy, I expect that it will not turn out statistically
signiﬁcant. Rather, I hypothesise that animacy (+/-hum) will be the
Ihypothesisethattopicality plays a role in inﬂuencing referential
choice. Next to animacy, antecedent distance and overall frequency of
referents are expected to play into referential choice which indirectly
As regards the choice between pronoun and zero, I will also analyse
has to be excluded in the analysis of noun phrases.
In conclusion, I formulate the following hypotheses to the two research
1. Speakers of Mandarin do not use a higher rate of zero arguments
than speakers of other languages in the corpus.
2. Probabilistic constraints inﬂuence referential choice. These con-
straints are the same in every language in the corpus. Language
and syntactic function do not inﬂuence referential choice. Topi-
cality (= animacy (+/-hum), antecedent distance and overall fre-
quency of referents) and person inﬂuence referential choice.
In the following section, I will now give an overview of the data I
use to test the hypotheses. Afterwards, I will explain the statistical and
quantitative methods that I use for the analysis.
3.2 The corpus
The approach I adopted in this thesis is a corpus-based and usage-based
one that diﬀers from the approach used in Generative Grammar.
While in earlier linguistics and in Generative Grammar, grammatical
rules were often stated in terms of introspection and intuition, I agree
with e.g. Chambaz & Desagulier (2016: 3) that intuitions do not always
show the full picture, especially with linguistic variation, and that a
database of natural language usage can give more information on how
speakers actually speak. In linguistics, this kind of database is often a
corpus, which is
guistic productions by native speakers. From a statistical
viewpoint, a corpus is a sample drawn from the true, unknown
law of a given language. (Chambaz & Desagulier 2016: 1)
The database of natural language usage that I use in the thesis is the
Multilingual Corpus of Annotated Spoken Texts (Multi-CAST, Haig &
Schnell 2019). I will give an overview of this corpus in the next section.
3.2.1 Multi-CAST (Haig & Schnell 2019)
At the time of data analysis, Multi-CAST (Haig & Schnell 2019) con-
sisted of eight sub-corpora, but more languages have been added re-
The goal of Multi-CAST (Haig & Schnell 2019) was to develop a
system of syntactic annotations that is ﬂexible enough to be applied
to typologically diverse languages, and at the same time still consistent
enough to enable quantitative cross-linguistic analysis between languages
(Haig & Schnell 2018 : 1).
Zero arguments are usually not added in other corpus annotations,
which means that studies on zero arguments have to add these in them-
selves. This poses a problem for cross-linguistic comparisons, since anno-
tations of zero arguments have to be consistent across languages in order
to be comparable. A huge advantage of Multi-CAST (Haig & Schnell
2019) is thus that annotations include zero arguments, with clear and
5This might lead to inconsistencies with regard to tier names (e.g. RefLex has
been changed to ISNRef), and some languages might have additional information
newly available now. The version I chieﬂy base my analysis on is version 1905 which
was the most recent one at the time of analysis; however, I have included preliminary
results from later versions (1907 and 1908) where necessary. The corresponding table
as well as R-scripts and the Mandarin data can be found in the Appendix for maximum
transparency and reproducibility.
strict guidelines on when to add them. It is therefore possible to not
only count the rate of zero arguments in the diﬀerent languages consis-
tently, but also to analyse referential choice without losing one of the
most crucial parts of it, namely covert arguments (Haig & Schnell 2018
Multi-CAST (Haig & Schnell 2019) oﬀers several diﬀerent tiers of
analysis in ELAN,6namely GRAID, RefIND and RefLEX (as well as
standard tiers on morphological glossing, free translation and transcrip-
GRAID (Haig & Schnell 2014) annotations provide us with informa-
tion on the syntactic function, morphological form and animacy features
of referential expressions (Haig & Schnell 2014: 2), which is consistent
over corpora and languages (Haig & Schnell 2014: 3).
Studies that have used Multi-CAST (Haig & Schnell 2019) in the
past include Haig & Schnell (2016), Haig et al. (2017), Schiborr (2018),
Kimoto (2018), Schnell & Barth (2018), Schnell et al. (2018), Schnell &
In the following section, I will give an overview of the languages con-
tained in Multi-CAST (Haig & Schnell 2019) that were used in the anal-
ysis. An overview of the languages and their geographical distribution
can be seen in Figure 1.
6ELAN Version 5.2 [Computer software] 2018, April 4,
http://tla.mpi.nl/tools/tla-tools/elan/ (Max Planck Institute for Psycholinguis-
tics, The Language Archive, Nijmegen, The Netherlands), see also Brugman &
Figure 1: Multi-CAST Languages.7
I have included the following languages in the analysis: Mandarin, North-
ern Kurdish, Sanzhi, Tondano, English, Cypriot Greek, Vera’a and Teop
(see Figure 1). Other languages that may be available in the corpus now
could not be used, since they were not available at the time of analysis. I
also did not use Persian (Adibifar 2016, 2019), since the corpus consists
of renarrations of the Pear stories (Chafe 1980, Schiborr 2016), and thus
diﬀers a little in this regard from the other languages. With respect to
the second research question,8only languages that include RefIND could
be included in the analysis, namely Mandarin, Cypriot Greek, Sanzhi,
Teop an d Vera ’a. A lis t of sp eakers of all languages and their metadata
can be found in Schiborr (2016).
8Which probabilistic constraints inﬂuence referential choice, and are these con-
straints diﬀerent from constraints in other languages?
Northern Kurdish (Haig et al. 2019a) Northern Kurdish belongs
to the West Iranian languages (Haig 2018). The Kurmanji corpus consists
of traditional narratives (Schiborr 2016: 4).
Sanzhi (Forker & Schiborr 2019) Sanzhi is a Nakh-Daghestanian
language spoken in central Daghestan, Russia (Schiborr 2019a: 12). The
corpus consists of traditional and autobiographical narratives (Schiborr
Tondano (Brickell 2016) Tondano is a n Aus tro nes ian l ang uag e sp o-
ken in Indonesia (Schiborr 2016: 5). The corpus is made up of autobi-
ographical and stimulus-based narratives, diﬀering in this regard from
most other languages in the corpus (Schiborr 2016: 5).
English (Schiborr 2015) English belongs to the Indo-European fam-
ily (Schiborr 2016: 4). The corpus consists of autobiographical narratives
(Schiborr 2016: 4).
Cypriot Greek (Hadjidas & Vollmer 2015) Cypriot Greek belongs
to the Indo-European language family (Schiborr 2016: 4). The corpus
contains traditional narratives (Schiborr 2016: 4).
Vera’a (Schnell 2015) Vera’a is an Austronesian language spoken on
Vanuatu (Schiborr 2016: 5). The corpus consists of traditional narratives
(Schiborr 2016: 5).
Teop (Mosel & Schnell 2015) Teop is an Austronesian language
spoken in Papua New Guinea (Schiborr 2016: 5). The corpus consists of
traditional narratives (Schiborr 2016: 5).
8Note that Tulil and Nafsan are not included in the analysis, since they were not
available. I am indebted to Nils Schiborr who made this map available to me.
Mandarin belongs to the Sino-Tibetan language family (Li & Thomp-
son 1981: 2). It is isolating (Li & Thompson 1981: 10) and is often
described as a topic-prominent language (Li & Thompson 1981: 15, Li
& Thompson 1976). The Mandarin data set consists of three monologic,
natural narratives from three diﬀerent native speakers of Mandarin. The
texts were translated, transcribed and then annotated with a morphemic
gloss, GRAID (Haig & Schnell 2014), RefIND (Schiborr et al. 2018) and
RefLEX, respectively. The Mandarin sub-corpus will be published in
Multi-CAST after the completion of this thesis.
All stories were told in Putonghua (Mandarin), the oﬃcial national
language of China (Li & Thompson 1981: 1). Note that Mandarin in
itself is in many ways an artiﬁcial construct that is taught to children at
school and may still be highly inﬂuenced by regional diﬀerences between
speakers (Li & Thompson 1981: 1). The traditional oral narratives were
recorded in Xi’an, China, by the author during an exchange semester in
2015 and 2016. Two of the speakers were originally from northeastern
China, and one speaker was from Xi’an. The speakers were my school-
mates or friends. They were asked to tell a story of their choice and were
recorded while telling it. Speakers were informed that their stories would
be used for research and published online. They were not informed of the
speciﬁc research questions of this study. It was agreed that they could
stay anonymous if they wanted. They were not paid for the recordings,
but mostly invited to eat together and spend time with me beforehand.
In addition, small presents from Germany were given to them where ap-
propriate. The three stories contain 1175 clause units altogether. More
stories have been recorded and transcribed and will hopefully be added
to the corpus in the future.
Figure 2: Home of one of the speakers.9
220.127.116.11 Jigongzhuan (jgz)
This speaker is from the northeast of China, he is a university student and
was 23 years old at the time of recording. The narrative tells anecdotes
of the life of an eccentric Buddhist monk. This story is quite famous in
China. Since there was unfortunately no better place, the story was told
inside a university building, which means that there is some background
noise from time to time that does not disturb the story, however.
The narrative was told in a group of friends and schoolmates, who
were all native speakers of Chinese, and every one of them told a story.
Since it would have been impolite and distancing to leave the room in-
stead of listening to the story, I was present as well.10 Istillstayedin
9Photo: Maria Vollmer, with the permission of the owner of the house.
10As a non-native speaker of Chinese, I tried to leave the room whenever possible,
in order to avoid possible inﬂuences on the way the story would be told when a listener
the background and encouraged the speaker of the story to tell it to his
friends rather than to me. The story is 21 minutes and 15 seconds long.
It consists of 720 clause units, eight of which are unclassiﬁable and were
thus excluded from analysis.
18.104.22.168 Liangzhu (lz)
The speaker of this story comes from Shaanxi Province, is a university
student and was 22 years old at the time of recording. He tells the
romantic love story of a couple that cannot be together because of societal
pressure and expectations. The story was recorded in my apartment,
since it was comparatively quiet and to make sure we were not disturbed
during the recording. The speaker told the story to his friend. I left the
room during the recording, thus only native speakers of Chinese were
present. The recording is eight minutes and 13 seconds long. The story
contains 189 clause units, seven of which were unclassiﬁable and will not
be included in the analysis.
22.214.171.124 Mulan (ml)
The speaker of this story is from the northeast of China, is a university
student and was 23 years old at the time of recording. He tells the story of
Mulan, a woman that, dressed as a man, secretly went to war in place of
her father, and became a war hero. The story was recorded in the same
setting as Liangzhu,namelyinmyapartment. Ilefttheroomduring
the recording so that only native speakers were present. The story is ten
minutes and 25 seconds long. It consists of 306 clause units, ﬁve of which
are unclassiﬁable and have thus been excluded from analysis.
is not a native speaker of Chinese.
126.96.36.199 Corpus annotation
The data were analysed using the software ELAN,11 and annotated ac-
cording to Multi-CAST standards, namely the GRAID annotation guide-
lines (Haig & Schnell 2014) and RefIND annotation guidelines (Schiborr
et al. 2018).
The stories were transcribed by Liu Ruoyu as part of her work at the
Department of General Linguistics (University of Bamberg), to whom I
owe many thanks. She also helped me with questions on the translation
or linguistic structure of sentences.12
The three stories were annotated using all layers or tiers of Multi-
CAST annotations, namely a transcription, a free translation, a morpho-
logic gloss, GRAID, RefIND and RefLex. There were some language-
speciﬁc choices to be made in the annotations by the author. The most
important ones were about serial verb constructions, topic constructions,
ﬂexible word classes and the so-called diﬀerential object marking, dis-
cussed in the paragraphs below in detail. All of these problems were dis-
cussed with the Multi-CAST Team (Geoﬀrey Haig, Stefan Schnell and
Nils Schiborr) before being implemented.
Diﬀerential Object Marking In Mandarin, the canonical word order
SVO (Iemmolo & Arcodia 2014: 316) can be changed when the object is
moved in front of the predicate and marked with a preposition, e.g. ba
and gei (Li & Thompson 1981: 463, Liu 2007.
11ELAN Version 5.2 [Computer software] 2018, April 4,
http://tla.mpi.nl/tools/tla-tools/elan/ (Max Planck Institute for Psycholinguis-
tics, The Language Archive, Nijmegen, The Netherlands), see also Brugman &
12Of course, many other native speakers helped me whenever I had questions.
Unfortunately, it is impossible to mention them all here, but the most important ones
are Wu Shuang, Song Jian, Wang Lei and Zhang Jujia.
“He already scared the midwife.” (mandarin_jgc_105)
In GRAID, we agreed to gloss preverbal ‘objects’ that are marked
with an adposition as ‘obl’ instead of ‘p’ since, from a strictly formal per-
spective, these are marked with an adposition and thus not canonically-
marked objects. We do not think that this is diﬀerential object marking
in a narrow sense, but since it is mostly called DOM in the literature
(e.g. Iemmolo & Arcodia 2014), this is what we call it here for pragmatic
Serial verb constructions This construction is problematic in a num-
ber of other languages in the corpus as well, e.g. in Northern Kurdish
(Haig et al. 2019b). In Mandarin, serial verb constructions are formally
very similar to (and often indistinguishable from) topic chains in which
multiple predicates occur as a string of verbs and their co-referential ar-
gument(s) are covert. While there are various language-speciﬁc means of
diﬀerentiating serial verb constructions from multiple predicates, often
involving the scope of negation or TAM markers over the whole pred-
icate instead of one single verb, in practice, most occurrences of serial
verb constructions in the corpus are formally ambiguous and thus indis-
tinguishable from topic chains with zero arguments:
“The couple went to Guo Qing temple to pray to Buddha.”
“to walk in the marriage sedan for the procession [to escort the
bride to the bridegroom’s home for the wedding].” (mandarin
In (15), zou and jin could be interpreted as being one single predicate
denoting ‘to walk in’, but could also be interpreted as two predicates in
somewhere. Simply on formal grounds, the second interpretation would
be more correct, since there is no formal marking that tells us that the
two verbs should be analysed as serial verbs.
In these cases, the constructions are thus glossed as multiple pred-
icates with covert arguments.‘svc_’ is added to the zero gloss. This
enables GRAID to capture as much information as possible, while still
giving researchers the possibility to exclude these zeros and thereby anal-
yse these constructions as serial verb constructions.
In cases in which
1. the string of verbs clearly denotes a single event or action, or
2. analysing the verbs as multiple predicates in a topic chain would
change their meaning in a way that would be contextually incorrect,
the construction is analysed as a serial verb construction. In these
cases, the main verb is glossed ‘v:pred’ and the other verb is glossed
‘svc_lv’ or ‘svc_rv’.
“He/they brought him over.” (mandarin_jgz_224)
In example (16), we know from context that neither of the participants
comes over, as the subject of the clause is already in the right place, and
the object of the clause is a new-born baby that could thus not be the
subject of guolai. Thus this construction is a serial verb construction and
guolai changes its semantics to a simple directional meaning.
In all the three stories, there are 71 instances of svc_lv/rv and 60
instances of svc_0. For comparison, there are overall 589 zero arguments
in the Mandarin sub-corpus. These cases thus make up 22%.
Flexible word classes Mandarin has relatively superﬂuous word
classes. For instance, Sun (2006: 206) notes that “[n]early all Chinese
prepositions can be used as full-ﬂedged verbs.” With regard to the cor-
pus, this poses a problem for prepositions that also act as verbs and are
often used in serial verb constructions. In these cases, the question is
if they are to be annotated as verbs or as prepositions; and, if they are
analysed as verbs, if they are serial verb constructions or two clauses.
This also aﬀects the annotation of the argument after the verb, since it
would be object (‘p’) if analysed as a preposition, but oblique (‘obl’) if
analysed as a verb. An example for this can be seen in (17):
“He has traveled to us.” (mandarin_jgz_226)
Here, dao could also be analysed as a verb, and the clause could then
be analysed as two clauses which would also increase the rates of zero
arguments. This also means that zheer is annotated as being a goal,
while it would be analysed as an object if dao were a full verb. However,
I have chosen to analyse these instances as prepositions, since this is
the primary use of the word, and since this is the analysis in which I
presuppose the least and am closest to the actual formal representation.
For com par ison, in o the r cases, dao is used alone and as a full verb,
as in (18):
“(They) came to the pawnshop.” (mandarin_jgz_0427)
Topic constructions Mandarin is claimed to be a topic-prominent
language, in which topics are claimed to be an important feature of the
grammar and which might even have topics (and comments) rather than
subjects (and predicates) as their most basic clause structure (e.g. Li &
Thompson 1981: 15f., 85ﬀ.). Topics may be separated from the rest of the
clause by pause particles (Li & Thompson 1981: 86), like ne in Example
(19). Note that here the topic is repeated as a subject in pronominal
‘And Lianshanbo, he himself thought’ (mandarin_lz_040)
When subjects are separated from the rest of the clause with a pause
particle and the subject is not repeated, as it was the case in Example
(19), zero is added in the gloss:
‘This Daoji, (he) usually read the scriptures in the temple yard.’
Here, daoji is separated from the rest of the clause with the pause
particle ba 13,andisthusanalysedastopic. Sincethereferentisnot
repeated overtly as a subject, as in Example (19), zero is added in the
zero is not added when the subject is a lexical noun phrase without
pause marker (Example 22) even though the subject may still be repeated
in the pronominal form (Example 21). The reason for this is that there
is no formal marking on the subject which lets us know that it is the
topic, except for its leftmost position in the clause.
‘But Zhu landlord, he had a daughter.’ (mandarin_lz_010)