2014. Deshors, Sandra C. & Stefan Th. Gries. A case for the multifactorial assessment of learner language: The uses of may and can in French-English interlanguage. In D. Glynn & J. Robinson (eds.), Corpus methods for semantics: quantitative studies in polysemy and synonymy, 179-204. Amsterdam: John Benjamins.


In this study, we apply Gries and Divjak's Behavioral Profile approach to compare native English can and may, learner English can and may, and French pou-voir. We annotated over 3,700 examples across three corpora according to more than 20 morphosyntactic and semantic features and we analysed the features' distribution with a hierarchical cluster analysis and a logistic regression. The cluster analysis shows that French English learners build up fairly coherent categories that group the English modals together followed by pouvoir, but that they also consider pouvoir to be semantically more similar to can than to may. The regression strongly supports learners' coherent categories; however, a variety of interactions shows where learners' modal use still deviates from that of native speakers.
A case for the multifactorial assessment of learner language
of learner language
e uses of may and can in French-English
Sandra C. Deshors and Stefan . Gries
New Mexico State University / University of California, Santa Barbara
In this study, we apply Gries and Divjak’s Behavioral Prole approach to com-
pare native English can and may, learner English can and may, and French pou-
voir. We annotated over 3,700 examples across three corpora according to more
than 20 morphosyntactic and semantic features and we analysed the features’
distribution with a hierarchical cluster analysis and a logistic regression. e
cluster analysis shows that French English learners build up fairly coherent cate-
gories that group the English modals together followed by pouvoir, but that they
also consider pouvoir to be semantically more similar to can than to may. e
regression strongly supports learners’ coherent categories; however, a variety of
interactions shows where learners’ modal use still deviates from that of native
Keywords: Behavioral Proles, hierarchical cluster analysis, logistic regression,
modal verbs
. Introduction and overview
Acquiring a foreign language is one of the most cognitively challenging tasks, giv-
en how languages dier in every level of linguistic analysis. From a cognitively and
psycholinguistically-oriented perspective, learning a language requires identifying a
very large amount of co-occurrence data – tense t and number n require subject-verb
agreement with morpheme m, idiom i consists of word w and word x, communica-
tive function f is communicated with intonation curve c, etc. – as well as storing and
retrieving them. Crucially, these types of co-occurrences are typically probabilistic
only rather than absolute/deterministic and, thus, hard to discern and learn: usually,
learners need to cope with many-to-many mappings between forms and functions,
 Sandra C. Deshors and Stefan . Gries
and oen it is only the conuence of dierently predictive information on several
levels of linguistic analysis that narrows down the search for a particular meaning (in
comprehension) or a particular form (in production). In the Competition Model by
Bates and MacWhinney (1982, 1989), for example, this situation is modeled on the
assumption that forms and functions are cues to functions and forms, respectively,
and many dierent cues of dierent strengths, validities, and reliabilities must be in-
tegrated to, say in production, arrive at natural-sounding choices.
Semantics is a particularly tricky linguistic domain in this regard, in native lan-
guage, but even much more so in foreign language learning. Not only do languages
oen carve up semantic space very dierently (so that the categories of the language
acquired rst will inuence category formation in the following), but semantic dier-
ences are also oen much less explicitly noticeable (than, say, the presence or absence
of a plural morpheme), which makes the identication of probabilistic co-occurrence
patterns all the more dicult. In order to allow for a precise description of semantic,
or more generally functional, characteristics of synonyms, antonyms, and senses of
polysemous words, Gries and Divjak developed the so-called Behavioral Prole (BP)
approach (cf. Gries and Divjak 2009). is approach, to be discussed in more detail
below, is highly compatible with a psycholinguistic perspective of the type outlined
above and involves a very ne-grained annotation of corpus data as well as their sta-
tistical analysis.
e method of behavioral proles has been successfully employed in a variety
of contexts – synonyms, antonyms, and word senses of polysemous words have been
studied both within one L1 or across two dierent L1s – as well as having received rst
experimental support, but so far there have been no studies that test the BP approach’s
applicability to L1 and L2 data, which is what we will undertake here. e semantic
domain we will explore is one that has proven particularly elusive, namely, modality.
While many semantic phenomena can be clearly delineated and, to some degree, ex-
plained by the linguistic analyst, modality has been much more problematic; in fact,
even the scope of the notion of modality has not really been agreed upon yet. In this
chapter, we specically focus on the semantic domain of  as reected in:
the choices of can vs. may in essays written by native speakers of English;
the choices of can vs. may in essays written by French learners of English;1
the use of pouvoir in essays written by native speakers of French.
In Section 2, we discuss in what sense these modals pose a particular challenge to the
analyst as well as present previous corpus-based work on can and may and highlight
. Following Bartning (2009), the term “advanced learner” is henceforth assumed to refer
to “a person whose second language is close to that of a native speaker, but whose non-native
usage is perceivable in normal oral or written interaction” (Hyltenstam et al. 2005: 7, cited in
Bartning 2009: 12).
A case for the multifactorial assessment of learner language 
some of the shortcomings of such work. In Section 3, we discuss the BP approach
in general as well as our own data and methods in particular. Section 4 presents the
results of our exploration, and Section 5 concludes the chapter.
. Setting the stage
. What is problematic about the modals?
As near synonyms in the domain of modality, may and can have fueled much the-
oretical debate with regard to their semantic relations. As a pair, both forms have
overlapping semantics which cover simultaneously the meanings of possibility, per-
mission and ability (cf. Collins 2009). is means that both forms can be used to ex-
press epistemic, deontic and dynamic types of possibility. It follows that the semantic
investigation of may and can triggers two problematic questions: rst, to what extent
the various senses of each form can be distinguished, and second, to what degree both
forms are semantically equivalent?
With regard to the rst question, studies such as Leech (1969) and Coates (1983)
have illustrated the diculty in distinguishing between the senses of may and can.
Leech (1969: 76), for instance, notes that “[t]he permission and possibility meanings
of may are close enough for the distinction to be blurred in some cases”. Similarly,
Coates (1983: 14) identies a “continuum of meaning” – i.e. gradience – in which
possible modal uses shade into each other. In the case of the meanings of can, for
instance, Coates notes that while permission and ability correspond to the core of two
largely intersecting fuzzy semantic sets, possibility, on the other hand, is found “in the
overlapping peripheral area” (p. 86).
With regard to the issue of the semantic equivalence of may and can, the literature
reveals similarly debated standpoints. While some studies recognize the similarities
of the two forms, others do not. In the former case, for instance, Collins (2009: 91)
states that “[t]he two modals of possibility may and can, share a high level of seman-
tic overlap” (despite their diering frequency of occurrence and dierent degrees of
formality), and Leech (1969: 75) notes that “[i]n asking and giving permission, can
and may are almost interchangeable. Conversely, studies such as Coates (1983) have
clearly distinguished the two forms. For instance, while Coates (1983) does recognize
that the English modals share certain meanings and can be organized into semantic
clusters, she generally denies the synonymy of may and can by classifying the two
forms into two distinct semantic groups. Although she accepts that the two forms
may have overlapping meanings in some cases, she claims that even then, the two
forms do not occur in free variation.
e occurrence of one form over the other has been shown to be inuenced, to
some extent, by its linguistic context. It has indeed been illustrated that particular
 Sandra C. Deshors and Stefan . Gries
co-occurring grammatical categories interfere with the interpretation of the modals.
Leech (2004: 77), for instance, notes that certain uses of may are only to be found in
particular grammatical contexts: “only the permission sense, for instance, is found
in questions (…) and the negation of the possibility sense is dierent in kind from
the negation of the permission sense. Generally, several grammatical categories have
been recognized as interacting with the uses of may and can. While negation is one
category that has commonly been identied (cf. Hermerén 1978; Palmer 1979; Coates
1980, 1983; De Haan 1997; Huddleston 2002; Radden 2007; Byloo 2009), voice and
sentence types have also been shown to have similar inuences on the forms.
Overall, the above-mentioned studies all provide clear illustrations of the com-
plexity of the semantic relations between may and can on the basis of empirically
gathered evidence. However, they all tend to be based on generalized observations of
idiosyncratic behavioral tendencies. In that respect, they all raise the issue of how to
provide a more systematic account of the modals’ semantic characteristics and how to
integrate qualitative ndings into a quantitative and empirically-grounded approach.
. Previous corpus-based work on the modals
.. Native English
As already mentioned above, Hermerén (1978) has shown that the semantics of the
modals in native English are morphosyntactically motivated to a considerable de-
gree such that linguistic categories such as voice, grammatical person, type of main
verb (action, state, etc.), aspect and sentence type inuence the interpretation of the
modals: “if these categories can be shown to modify the meaning of the modal […] it
is important that this should be accounted for in the description of the semantics of
the modals” (p. 74). While this claim calls for empirical validation, one implication of
Hermerén’s (1978) argument is that the quantitative study of modal forms will require
a powerful and versatile methodological approach. In a very similar fashion, Klinge
and Müller (2005: 1) argue that, to capture the essence of modal meaning, “it seems
necessary to cut across the boundaries of morphology, syntax, semantics and prag-
matics and all dimensions from cognition to communication are involved.
A second corpus-based study of the modals in native English is Gabrielatos and
Sarmento (2006). is study illustrates an attempt to account for syntactic contextu-
al information while using a quantitative corpus-based approach to investigate core
English modals (i.e. can, could, may, might, must, shall, should, will and would). Al-
though their study does not involve the comparison of English varieties, it presents,
however, a comparative analysis of the frequencies of uses of the modals in an aviation
corpus and a representative corpus of American English. Generally, it raises the fol-
lowing questions:
A case for the multifactorial assessment of learner language 
To what degree do syntactic structures and modal forms interact contextually?
To what degree does such interaction aect investigated modal forms semant ically?
How can such interaction be quantitatively investigated in a corpus including
cross-linguistic and interlanguage data?
e authors acknowledge that the modals’ distribution varies as a function of their
syntactic contexts and they show that frequencies of occurrence of core English
modals reect the type of syntactic environment in which they feature: “there is a
great deal of variation in the use of modal verbs and the structures they occur in,
depending on the context of use” (p. 234). However, their lack of a suitable cogni-
tively-motivated theoretical framework prevents them from providing a meaningful
interpretation of the data and to further explore their ndings.
To this date, Collins (2009: 1) presents:
the largest and most comprehensive [study] yet attempted in this area [modality]
based on an analysis of every token of the modals and quasi-modals (a total of
46,121) across the spoken and written data.
Collins (2009) investigates the meanings of the modals in three parallel corpora of
contemporary British English, American English and Australian English. Despite the
author’s recognition that a corpus quantitative approach “typically combined with a
commitment to the notion of ‘total accountability’ may inuence hypotheses applied
to the data, or formulated on the basis of it” (p. 5) and despite the large size of his data
set, his analysis is of limited informative value due to:
a theoretical framework that does not allow for the full exploitation of the linguis-
tic context of the modals, and;
a statistical approach that inhibits rather than unveils linguistic patterns at play in
the data.
With regard to the rst point, Collins (2009) restricts his approach to the identica-
tion of the forms’ lexical meanings. His theoretical framework consists of a tradition-
al tripartite taxonomy including epistemic, deontic and dynamic senses. Regrettably,
while he recognizes that some uses of the modals can yield preferences for particular
syntactic environments, his analysis does not address that fact in a systematic quan-
titative fashion. As for the second point, while, statistically, Collins (2009) limits his
investigation to providing frequency tables of modal forms, his overall approach is
problematic because it is based on the erroneous assumption that the frequent occur-
rence of a modal form warrants its linguistic relevance. In the case of may and can, for
instance, Collins uses raw frequencies to show that deontic may is the “least common
sense of the three as it is chosen 7% of the time over epistemic may (79%) and dynam-
ic may (8.1%). However, he does not show whether the (low) frequency of deontic
may is signicantly dierent from the also low frequency of dynamic may, and our
 Sandra C. Deshors and Stefan . Gries
analysis of his data shows that, excluding the indeterminate cases, the distribution of
may’s senses across the American, Australian, and British data is highly signicant
(χ2=42.68; df = 4; p < 0.001). is, in turn, raises the questions of:
To what extent are Collins’ (2009) frequencies of the occurrences of modal forms
in each corpus comparable?
Since the observed frequency discrepancies are not a matter of chance, then what
motivates, linguistically, the dierent uses of each form in each independent
So in sum, while studies such as Gabrielatos and Sarmento (2006) and Collins(2009)
provide many descriptive results, they are oen merely or largely form-based alone
and are lacking in terms of determining which of the many frequencies are statistical-
ly and/or linguistically relevant. As a result, such studies do not come close to allow
us to develop a characterization of modals that essentially allows us to classify/predict
modal use.
.. Learner English and contrastive approaches
From a cross-linguistic and an interlanguage perspective, investigating the modals
raises two related issues, namely (i) the possibility of a lack of (direct) semantic equiv-
alence between the modal forms in the learner’s native language (L1) and his/her
target language (L2), and (ii), the fact that such cross-linguistic semantic dissimilarity
will aect the uses of the forms in L2. e modals may and can and native French
pouvoir illustrate the case in point. Despite the fact that all three forms contribute to
the expression of the semantic notion of , pouvoir synchronically covers
the whole range of the modal uses of may and can.
One corpus-based study of learners’ use of modals is Aijmer (2002), which is
based on a corpus of Swedish L2 English writers. She compares (i) the frequencies
of key modal words in native English and advanced Swedish-English interlanguage,
as well as (ii) frequencies encountered in Swedish learner English with those from
comparable French and German L2 English. Aijmer’s study indicates “a generalized
overuse of all the formal categories of modality” and she further points out that “it
is only at a functional level that any underuse was detected, with the learner writers
failing to use may at all in its root meaning” (p. 72).
Similarly, Ne et al. (2003) investigate the uses of modal verbs (can, could, may,
might and could) by writers from several L1 backgrounds. Ne et al. (2003) use a
learner corpus including Dutch-, French-, German-, Italian-, and Spanish-English
interlanguage, which they contrast with a reference corpus of American university
English. Ne et al. (2003: 215) identify the case of can as potentially interesting “since
it is overused by all non-native writers”. ey further report that the frequency of may
by French native speakers stands out in comparison to the frequencies by all other
non-native speakers included in the study, but since their study does basically nothing
A case for the multifactorial assessment of learner language 
but compare raw frequencies of occurrence regardless of any contextual features , it is
not particularly illuminating.
Generally, and similar to Gabrielatos and Sarmento (2006) and Collins (2009),
both Aijmer (2002) and Ne et al. (2003) made the disadvantageous methodological
decision to conveniently, but ultimately problematically, rely on information that is
retrievable without human eort. In addition, even the studies that address learner
use do not relate their ndings to the wider context of (second) language acquisition.
In a corpus-based contrastive study, Salkie (2004) investigates the nature of the
semantic relations between the three forms in native English and native French. He
uses a subpart of the parallel corpus INTERSECT (cf. Salkie 2000), and focuses on
three working hypotheses, namely that:
– “pouvoir corresponds more closely to one of the English modals rather than the
other” (p. 169);
– “pouvoir is less specic than the English modals” (p. 170);
– “pouvoir has a sense which is dierent from both the English modals but is not
just a general sense of possibility” (p. 170).
While Salkie (2004) concludes in favour of the third hypothesis, it is worth point-
ing out, however, that his results were based on only 100 randomly extracted occur-
rences of each English modal form (i.e. may and can) and their respective French
By way of a more general summary, it is probably fair to say that corpus-based
approaches to modality in L1 and L2s leave things to be desired. Some studies point
to the immense complexity of the subject but do not choose multifactorial or multi-
variate methods that are capable of addressing this degree of complexity. In addition,
some studies are based on large numbers of modals but, frankly, do not do very much
with the vast amount of data other than present arrays of statistically under-analyzed
frequency tables. On the other hand, the analytically much more interesting studies
of the kind of Salkie (2004) are based on very small samples. Finally, many studies are
largely if not exclusively form-based and focus only on learners’ over-/underuse of
modals in particular examples or kinds of contexts.
. Characteristics of the present study
.. Methodological considerations
e above discussion fairly clearly indicates what kinds of steps would be desirable,
an approach that:
can integrate linguistic information and patterning from many dierent levels of
linguistic analysis in a way alluded to by Hermerén (1978), as well as Klinge and
Müller (2005);
 Sandra C. Deshors and Stefan . Gries
involves not only a sample that is studied with regard to more linguistic parame-
ters, but at the same time also larger than the previous studies that aimed at more
than description;
explores similarities and dierences of L1 uses of can and may, but also explores
the way these English modals are used in L2 language (here from French learners)
as well as how the same concept is used by the learners in their L1 (here pouvoir).
Given these demands, we decided to use the so-called Behavioral Prole approach,
which ts the above wish list very well. It combines the statistical methods of con-
temporary quantitative corpus linguistics with a cognitive-linguistic and psycholin-
guistic perspective or orientation (cf. Divjak and Gries 2006, 2008, 2009; Gries 2006,
2010b; Gries and Divjak 2009, 2010; and others). As such, it diverges radically from
the above-mentioned more traditional corpus-based approaches to modality in both
L1 and L2. Methodologically, it involves four steps:
the retrieval of all instances of a words lemma from a corpus in their context;
a manual annotation of a number of features characteristic of the use of the word
forms in the data; these features are referred to as ID tags and typically involve
morphosyntactic and semantic features in particular. Each ID tag contributes to
the proling of the investigated lexical item(s);
the generation of a table of co-occurrence percentages, which specify, for exam-
ple, which words (from a set of near-synonymous words) or senses (of a polyse-
mous word) co-occur with which morphosyntactic and/or semantic ID tags; it is
these vectors of percentages that are called proles;
the evaluation of that table by means of statistical techniques.
Given how this approach is completely based on various kinds of co-occurrence in-
formation, it comes as no surprise that, just like much other work in corpus linguis-
tics, the BP approach assumes that “the distributional characteristics of the use of an
item reveals many of its semantic and functional properties and purposes” (Gries and
Otani 2010: 3). While these previous studies have investigated a variety of dierent
lexical relations (near synonymy, polysemy, antonymy) both within languages (Eng-
lish, Finnish, Russian) and across languages (English and Russian), the present study
will add to the domains in which Behavioral Proles have been used in two ways:
(i)so far, no non-native language data have been studied, and (ii) we will add French
to the list of languages studied.
As the rst BP study focusing on learner data, and only the second BP study that
compares data from dierent languages, this paper is still largely exploratory. We will
mainly be concerned with the following two issues:
To what degree can the Behavioral Proling handle the kind of learner data that
are inherently more messy and volatile than native data and provide a quantita-
tively adequate and ne-grained characterization of the use of can and may by
A case for the multifactorial assessment of learner language 
native speakers and learners, and how does that use compare to the use of French
speakers’ use of pouvoir?
As a follow-up, and if meaningful groups of uses emerge, to what degree do the
distributional characteristics that BP studies typically include allow us to predict
native speakers’ and learners’ choices of modal verbs, and how do these speaker
groups dier?
e former question will be explored with the kind of cluster-analytic approach usu-
ally employed in BP studies; for the latter question, we will turn to a logistic regression
(cf. Arppe 2008 for another BP approach using (multinomial) regression).
.. eoretical orientation
In previous studies, the BP approach was used for more than just the quantitative
description of the data. Rather, it is rmly grounded in, and attempts to relate the
results of the statistical exploration of the data to usage-based/exemplar-based ap-
proaches within Cognitive Linguistics and psycholinguistics. While this orientation
is also compatible with our current goals, there is one particular earlier model in L2/
FLA research that is especially well-suited to, or compatible with, our current objec-
tives, namely the Competition Model (CM) by Bates and MacWhinney (cf. Bates and
MacWhinney 1982, 1989). is model is “a probabilistic theory of grammatical pro-
cessing which developed out of a large body of crosslinguistic work in adult and child
language, as well as in aphasia” (Kilborn and Ito 1989: 261). MacWhinney (2004: 3)
himself characterized it as a “unied model [of language acquisition] in which the
mechanisms of L1 learning are seen as a subset of the mechanisms of L2 learning”.
e CM is characterized by the two following assumptions:
Linguistic signs map forms and functions onto each other (probabilistically) such
that forms and functions are cues to functions and forms respectively.
In language production, forms compete to express underlying intentions or func-
tions, and in language comprehension, the input contains many dierent cues of
dierent strengths, validities, and reliabilities, which must be integrated: native
speakers “depend on a particular set of probabilistic cues to assign formal surface
devices in their language to a specic set of underlying functions” (Bates and
MacWhinney 1989: 257).
As a usage-based and probabilistic model, the CM assumes that both frequency and
function determine the choice of grammatical forms in language production; as with
most usage-based and/or corpus-linguistic approaches, we too consider frequency in
a corpus as a proxy for frequency of exposure (in both comprehension and produc-
tion). Cross-linguistically, this is an important assumption because across languages
cues are instantiated in dierent ways and speakers assign them varying degrees of
strength. It is therefore important to describe and explain L1 statistical regularities as
 Sandra C. Deshors and Stefan . Gries
“[t]hey are part of the native speaker’s knowledge of his/her language, and they are an
important source of information for the language learner” (Bates and MacWhinney
1989: 15).
Overall, Kilborn and Ito (1989: 289) conclude that existing psycholinguistic stud-
ies have successfully demonstrated that the CM is appropriate for the characterization
of learner language through cue distributions and they report “extensive evidence for
the invasion of L1 strategies into L2 processing”. In addition, it is also obvious how
much the CM is compatible with a BP approach. e main notions that drive the
Competition Model are cue strengths, validities, and reliabilities, and all of these are
essentially conditional probabilities, i.e. percentages. While the BP approach as such
does not cover the full complexity of how conditional cue strengths, validities, and
reliabilities can interact, it is a useful and experimentally validated (cf. Divjak and
Gries 2008) approach employing a similar logic.
A theory of language transfer requires that we have some ability to predict where
the phenomena in question will and will not occur. In this regard contrastive
analysis alone falls short; it is simply not predictive. (Gass 1996: 324)
. Data and methods
. Retrieval and annotation
e data are from three untagged corpora: the French subsection of the International
Corpus of Learner English (henceforth ICLE-FR), the Louvain Corpus of Native English
Essays (LOCNESS), and the Corpus de Dissertations Françaises (CODIF). All corpora
included in the present work were collected by the Centre for English Corpus Lin-
guistics (CECL) at the Université Catholique de Louvain (UCL) and made available
to us by the Director of the Centre, Professor Sylviane Granger. ICLE-FR has a total
of 228,081 words, including 177,963 words of argumentative texts and 50,118 words
of literary texts. LOCNESS is a 324,304-word corpus that includes three sub-data
sets: a 60,209-word-sub-corpus of British A-Level essays, a 95,695-word sub-corpus
of British university essays and a sub-corpus of American university essays that has
168,400 words. e CODIF is a corpus of essays written by French-speaking un-
dergraduate students in Romance languages at the Université Catholique de Lou-
vain (UCL). CODIF also includes argumentative and literary texts and has a total of
. Information on the total number of words featuring in each individual text type (i.e. argu-
mentative, literary) is not available.
A case for the multifactorial assessment of learner language 
Given the corpora’s compositions, the three corpora included in our study are
highly comparable. ey all consist of written data produced by university students
(ICLE, CODIF, the LOCNESS British and American university sections) or by stu-
dents approaching university entrance (i.e. the LOCNESS British A-Level section).3
All participants’ contributions are in the form of an essay of approximately 500 words
long. In terms of content, all essays deal with similar topics such as: crime, education,
the Gulf War, Europe, or university degrees.
e data we subjected to the BP approach consist of instances of may and can in
native English and French-English interlanguage as well as pouvoir in native French
from the above corpora. Using scripts written in R (cf. R Development Core Team
2010), we retrieved 3,710 occurrences of the investigated modal forms from all
sub-corpora, which were imported into a spreadsheet soware and annotated for 22
morphosyntactic and semantic variables.4 Table 1 exemplies this database with a
very small excerpt of these data, and Table 2 presents the total range of variables in-
cluded in the study and their respective levels.
For each variable, an encoding taxonomy was designed prior to annotation. Due
to the large number of variables included in this study and the absence of a number of
them from previous studies on the English modals, not all encoding taxonomies were
theoretically motivated. In cases where the annotation is not based on accounts from
the existing literature, a bottom-up approach was adopted for the identication of re-
current features in the data. is procedure, for instance, was carried out in the case of
the variable VS where, prior to annotation, recurrent semantic features
were identied as characteristic of the lexical verbs used alongside the modals.
. e inclusion of the LOCNESS British A-Level section alongside sub-corpora solely in-
cluding university participants is not judged problematic as LOCNESS only involves English
native speakers whose level of English is not expected to develop any further.
. Although the annotation process included a variable encoding the semantic role of the
subject referent of the modals, this study does not account for that variable due to its high cor-
relation with VOICE.
Table 1. Excerpt of an annotation table including selected variables
C M C CT U VS N RA
5may native coordinate process ment/cog/emotional armative animate
133 may native main state copula armative inanimate
1760 may native main process ment/cog/emotional negative animate
1886 can il coordinate process ment/cog/emotional armative animate
2876 cannot il subordinate state abstract negative inanimate
3540 peut fr main process ment/cog/emotional negative animate
3645 peuvent fr subordinate process abstract negative inanimate
 Sandra C. Deshors and Stefan . Gries
Because of space restrictions, we are not able to provide a more comprehensive
account of the annotation process (but cf. Deshors 2010 for details). However, three
variables – S, VT, and VS – require some brief explanatory
.. e variable S
As for S, the semantic category of modality includes a wide range of hetero-
geneous meanings that many scholars have attempted to unite under a variety of
categorization systems (cf. Palmer 1979; Coates 1983; Bybee and Fleischman 1995;
Huddleston 2002; Nuyts 2006; Byloo 2009). While Depraetere and Reed (2006: 277)
note that “in classifying modal meanings, it is possible to use various parameters as
criterial to their classication, this study assumes a coding taxonomy based on a
traditional tripartite distinction between epistemic, deontic and dynamic meanings.
Table 2. Overview of the variables used in the study and their respective levels
Typ e Variable Levels
data C native, interlanguage, French
GA (acceptability) yes, no
syntactic N (negation) armative, negated
ST (sentence type) declarative, interrogative
CT (clause type) main, coordinate, subordinate
morphological F can, may, pouvoir (and negated forms)
SM: subject
adj., adv., common noun, proper noun,
relative pronoun, date, noun phrase, etc.
SP: subject person 1, 2, 3
SN: subject number singular, plural
V active, passive
A perfect, perfective, progressive
M indicative, subjunctive
SRN: subject
referent number
singular, plural
semantic S epistemic, deontic, dynamic
SP weak, medium, strong
U accomplishment, achievement, process,
VS abstract, general action, action incurring
transformation, action incurring move-
ment, perception, etc.
RA: subject referent
animate, inanimate
AT: subject referent
animacy type
animate, oral, object, place/time, mental/
emotional, etc.
A case for the multifactorial assessment of learner language 
Following Nuyts (2006: 6), epistemic senses concern “an indication of the epistemic
estimation, typically, but not necessarily, by the speaker, of the chances that the state
of aairs expressed in the clause applies in the world. Consider (1) as an illustration
of epistemic may:
(1) indeed, Europe 92 may lead to the disappearance of cultural dierences
Following Palmer (1979: 58), deontic modality refers to cases where “[b]y uttering a
modal, a speaker may actually give permission (may, can)”. (2) illustrates deontic can:
(2) if all public schools started to say you can only come here if you are Hispanic
or if you are Polish, our schooling system would be in great chaos
Finally, dynamic meanings denote “an ascription of a capacity to the subject-partic-
ipant of the clause (the subject is able to perform the action expressed by the main
verb in the clause)” (Nuyts 2006: 3). Generally, dynamic modality expresses the po-
tentiality of an event occurring. Nuyt’s type of dynamic modality includes ability/
capability cases where the possibility of event occurrence stems from the ability of the
(grammatical) subject to carry out the event. In that regard, the term ability is not re-
stricted to a ‘physical’ interpretation and equally applies to mental and technical types
of ability. Example (3) illustrates dynamic can:
(3) Mrs Ramsay is the central character because she can see the whole personality
of the other ones
Generally, our frequencies of use of may and can in their dierent senses match those
previously encountered in existing studies solely concerned with the native use of
the modals, such as Coates (1980) and Collins (2009). While Coates (1980: 218), for
instance, reports that “by far the most common usage of may is to express epistemic
possibility”, she stresses the distinctive nature of the uses of may and can:
e patterns resulting from my analysis of the data (…) leads me to conclude
that in normal everyday usage may and can express distinct meanings: may is
primarily used to express epistemic possibility, while can primarily expresses root
.. e variable VT
e variable VT targets the lexical verbs with which the forms are used and
characterizes their telicity. Conceptually, the variable VT follows Vendler
(1967) in its recognition that the notion of time is crucially related to the use of a
. Coates (1980, 1983) categorizes modal meaning according to a two-way distinction that
includes epistemic and non-epistemic modality. She refers to the latter type as “root” modality.
 Sandra C. Deshors and Stefan . Gries
verb and is “at least important enough to warrant separate treatment” (p. 143). is
variable assesses:
– whether may and can have preferences for lexical verbs denoting a state, a process,
an accomplishment or an achievement,6 and if so,
it identies in which type of corpus preferential patterns occur.
.. e variable VS
Similarly to the variable VT, VS identies the type of semantic
information conveyed by the lexical verbs used with the modals. e internal organ-
ization of this variable results from a bottom-up approach and does not follow any
particular theoretical framework. is variable consists of the levels denoting abstract
process, physical actions, actions incurring movement, actions incurring some physi-
cal transformation, communicative processes, mental/cognitive/emotional processes,
perception processes and verbal statement involving a copula verb. Example (4) illus-
trates a case where the lexical verb expresses a mental/cognitive/emotional process:
(4) Her search for the nal touch can be seen as a search for harmony
Once all matches were annotated, the resulting data table was evaluated statistically.
. e BP approach in this study: Statistical analysis
As mentioned above, the data were evaluated in two dierent ways.7 e rst of these
involved the type of cluster analysis that is characteristic of much work using the BP
methodology. In this rst part, we used Gries ’s (2010a) R script Behavioral Proles
1.01 and computed ve behavioral proles, one for each modal form as occurring in
each language variety, i.e. native can, native may, interlanguage (IL) can, IL may, and
native pouvoir (FR). Such proles consist of vectors of co-occurrence percentages of a
single modal form with each level of all independent variables and provide form-spe-
cic summaries of their semantic and morphosyntactic behavior in each sub-cor-
pus. In a second step, the proles were assessed statistically with a hierarchical cluster
analysis to explore the similarity and dierences between the modal forms, and in
keeping with previous studies (cf. Divjak and Gries 2006), we chose the Canberra
metric as a measure of (dis)similarity and Ward’s rule as an amalgamation strategy.
. Accomplishment verbs encode verbal statements that imply a unique and denite time
period; achievement verbs encode verbal statements that imply a unique and denite time in-
stant; process verbs identify statements that reect non-unique and indenite time periods;
state verbs identify statements that reect non-unique and indenite time instants.
. All statistical computations and plots were performed with R (for Linux), version 2.11.0
(see R Development Core Team 2010).
A case for the multifactorial assessment of learner language 
Following Gries and Otani (2010), we computed dierent cluster analyses, one in-
volving all variables that the uses of the modals were annotated for, one for only the
syntactic variables, and one for only the semantic variables.
e second analytical step involved a binary logistic regression including the fol-
lowing variables and predictors:
– F as the dependent variable with only two levels here: can vs. may;
– GA, N, ST, CT, SM, SP, SN,
V, A, M, SRN, S, SP, U, V -
S, RA, AT as independent variables in the form of main
all these variables’ interactions with C as additional predictors (to see
which variables’ inuence on modal use diers the most between L1 English and
L2 English).
e logistic regression was then performed with the model selection process during
which insignicant predictors were discarded from the model: rst insignicant in-
teractions, then individual variables that were not signicant and did not participate
in a signicant interaction.
. Results and discussion
. Cluster analysis
Our rst cluster analysis yielded the results shown in Figure 1. e le plot is a den-
drogram of the ve modal forms that were clustered; the right plot represents average
silhouette widths for assuming two, three, and four clusters. e average silhouette
widths point to a two-cluster solution, maybe a three-cluster solution, but the dier-
ence is minor since the former would result in a French-vs.-English clustering, and
the latter in a French-vs.-can-vs.-may clustering. is is compatible with Salkie’s anal-
ysis, who argued that pouvoir is very dierent from both can and may, and intuitively,
both these solutions “make sense, which provides rst evidence in favor of the ap-
proach. To anticipate the potential objection that this may seem trivial, let us mention
that it is in fact not. e data in Figure 1 show that the BP vectors are good and robust
descriptors of how the modals behave because many other theoretically possible clus-
ter solutions, such as the ones listed in (5), would not have made linguistic sense at all.
(5) a. {{{canil maynative pouvoir} cannative} mayil}
b. {{{cannat ive mayil pouvoir} canil} maynative}
c. {{canil maynat ive} {pouvoir mayil} cannative}
 Sandra C. Deshors and Stefan . Gries
However, in what follows we show that a ne-grained comparative description of
cross-linguistic language varieties can be obtained by focusing on dierences between
the independent variables used for clustering. Consider Figure 2, which shows the
dendrograms for all morphosyntactic variables and all the semantic variables in the
le and right panel, respectively.
Interestingly, the results show that the intuitively very reasonable dendrogram in
Figure 1 is not replicated by looking at morphosyntax or semantics alone, which to
some extent at least contrasts with Gries and Otani’s results, where the results did not
dier very much between the three clusterings. e reasonable similarities of Figure 1
emerge only when all variables are combined. In particular, in both panels of Figure 2
canil and cannat ive are grouped together, but then the remaining forms are grouped dif-
ferently. In the morphosyntactic dendrogram, the two kinds of may are successively
amalgamated and the French pouvoir is only added aer all English forms have been
700 750 800 850
900 950
0 1 2 3 4
0.2 0.4 0.6 0.8 1.0
(Average) silhouette widths
Number of clusters in the solution
0.1 0.07 0.03
Figure 1. Dendrogram for all independent variables (il = interlanguage) 2.0 2.5 3.0 3.5
4.0 4.5
86 10 12 14
16 18
Figure 2. Dendrograms for all morphosyntactic variables (le panel) and all semantic
variables (right)
A case for the multifactorial assessment of learner language 
clustered. In other words, morphosyntactically, we nd a clear English-French divide,
but interlanguage may is too dierent from native may to be grouped together. To
identify the source of this dierence, we used what in BP approaches has been called
a snakeplot, namely a plot of the pairwise dierences between the percentages for, in
this case, mayil and maynative (cf. Divjak and Gries 2009 or Gries and Otani 2010 for
more examples).
As indicated in Figure 3, the main morphosyntactic ways in which learners de-
viate from native speakers are that learners underuse may in subordinate clauses and
in negated clauses. is is in fact an interesting nding because it means that learners
disprefer the rarer of the two modals – may – in those contexts which are already
morphosyntactically more challenging, as if using can is the default they resort to
when they are already under a higher processing load (cf. the so-called complexity
In the semantic dendrogram, by contrast, we nd a dierent patterning. Seman-
tically, canil and cannative are again very similar and grouped together early, but then
the next clustering step groups the two forms of may together. However, interestingly,
it is not the English forms that are then all grouped together – rather, contrary to
Salkie’s earlier analysis, pouvoir is semantically more similar to can than may is.
. Logistic regression
e model selection process involved thirteen steps during which insignicant predic-
tors were discarded. e nal and minimally adequate model includes 16 signicant
variables and 6 signicant interactions and returned a highly signicant correlation:
loglikelihood chi-square = 3296.47; df = 60; p < 0.001; the correlation between the
subord clauses
main clauses
several tiny or no differences
Syntactic dierences mat interlang - may native
Figure 3. Snakeplot for most extreme dierences between syntactic ID tags of may
 Sandra C. Deshors and Stefan . Gries
observed forms – may vs. can – and predicted probabilities is very high: R2 = 0.955.
Correspondingly, the model’s classicatory power was found to be very powerful with
a classication accuracy of 99%. Table 3 summarizes all the signicant variables and
interactions yielded in the nal model.
Overall, the nal model includes one signicant interaction involving a morpho-
logical variable (out of seven morphological variables), two signicant interactions
involving syntactic variables (out of three syntactic variables) and three signicant in-
teractions involving semantic variables (out of eight semantic variables). But what do
the interactions reect? Let us begin with C:CT, as represented in Figure 4.
e frequencies of may and can dier with regard to the type of clauses in which
they occur in native and learner English. e (weak!) eect is that, in interlanguage
Table 3. Overview of the results of the nal GLM model
Predictor Chi-square (df)Predictor Chi-square (df)
C 24.9 (1) *** AT 98.2 (11) ***
GA 13.8 (1) *** V 55.0 (1) ***
U 67.9 (1) *** ST 47.2 (1) ***
E 100.0 (2) *** N 87.2 (1) ***
CT 10.9 (1) *** SP 29905.9 (2) ***
VT 97.4 (2) *** C:CT 60.0 (2) ***
VS 384.9 (6) *** C:VS 32.2 (6) ***
SP 26.6 (2) *** C:SN 37.4 (1) ***
SN 1.3 (1) ns C:RA 122.2 (1) ***
SM 49.1 (4) *** C:AT 118.2 (11) ***
RA 59.2 (1) *** C:N 12.0 (1) ***
Coordinate Main
Coordinate Main
Figure 4. Bar plots of relative frequencies of C:CT
A case for the multifactorial assessment of learner language 
English, can is more strongly preferred over may in main clauses than it is in native
While, as previously noted, existing literature concerned with the native use of
the modals commonly recognizes negation as “an important aspect of modal mean-
ing” (Hermerén 1978), our study not only conrms the need to include negation in
an investigation of the uses of the modals but further recognizes its signicance as a
morphological criteria to assess interlanguage (dis)similarity. Consider Figure 5 for
the interaction C:N.
Figure 5 shows that, while all speakers prefer to use can in negated clauses, the in-
terlanguage speakers do so more strongly. is result does not come as a surprise: On
the one hand, this is also compatible with the complexity principle – negated clauses
are more complex and preferred with the more frequent modal. On the other hand,
where epistemic may not would be used in English, French speakers would tend to
use a lexical verb along with the adverb peut-être to indicate the speaker’s uncertainty,
as illustrated in (6):
(6) a. is may not be the case
b. Ce n’est peut-être pas le cas
Consider Figure 6 for the interaction C:SN.
While native speakers use can more oen with singular subjects than with plural
subjects, it is the other way round with the learners, again a result compatible with the
complexity principle.
While the native speakers’ choices of may and can do not vary much between
animate and inanimate subjects, the learners’ choices do: with animate subjects, they
prefer can much more strongly. Figure 7 represents the interaction C:RA.
Figure 5. Bar plots of relative frequencies of C:N
 Sandra C. Deshors and Stefan . Gries
Consider Figure 8 for the interaction C:VS; the upper panel
represents the interlanguage data, the lower panel represents the native speaker data,
and the bars are sorted from large absolute pairwise dierences (le) to small absolute
pairwise dierences (right).
e learners and the native speakers dier most strongly with semantically more
abstract verbs and time/place verbs, as in He thinks that if he can achieve one impossi-
ble act, then this will change everything.
e learners prefer can with abstract verbs more strongly than the native speak-
ers, but they prefer may more strongly with time/place verbs. However, there are also
(less pronounced) dierences for verbs that would typically have a human agent.
Figure 6. Bar plots of relative frequencies of C:SN
Figure 7. Bar plots of relative frequencies of C:RA
A case for the multifactorial assessment of learner language 
For instance, the learners prefer may with communication verbs and can with ac-
tion-transformation verbs. Virtually no dierence at all is found with copulas.
As for the nal interaction, C:AT, we do not represent it here
graphically. While it is signicant, the large number of categories plus the fact that the
most pronounced dierences occur with a small number of very infrequent categories
does not yield much in terms of interesting ndings.
As for the main eects, we will not discuss them here in detail. is is because
these main eects by denition do not tell us anything about the can and may varia-
bles across languages (since these variables do not interact with C). However,
since they do tell us something about which modal verb is preferred by both native
speakers and learners, we summarize them here visually in Figure 9. e x-axis lists
the main eects, on the y-axis we show the percentage of can obtained for levels of
these main eects, and then the levels are plotted at their observed percentage of can;
the dashed line represents the overall percentage of can in the data.
templ comm act_transf act_gen/mot ment/perc copula
can can
templ comm act_transf act_gen/mot ment/perc copula
Figure 8. Bar plots of relative frequencies of C:VS
 Sandra C. Deshors and Stefan . Gries
Finally, a brief look at the regression’s misclassications seems to indicate that
they did not occur randomly. While all 34 misclassications occurred in the inter-
language data, 29 of them occurred with may in a form characteristic only of the
French-English learner language. In the large majority of those misclassications,
may is found to express a possibility that results from some sort of theoretical demon-
stration. Consider the examples in (7) and (8). While the ones in (7) illustrate our
current point, (8) provides an additional example of an atypical occurrence of learner
may, which clearly denotes a strong sense of possibility and whose interpretation is
heavily reminiscent of that of can.
(7) a. So we may say that …
b. To conclude, we may say that …
c. As a conclusion, we may say that …
d. is is why we may now speak of the stupefying eect
e. is is the reason why we may say that …
(8) “Dresden is an old town, we may read of its history
. Concluding remarks
By way of a summary, the BP approach and the subsequent logistic regression allows
us to recognize how can and may (in native and learner English), as well as pouvoir,
relate to each other as well as what helps determine native speakers’ and learners’
choices. On the whole, distributionally we do nd the expected groupings: the cans,
then the mays, and only then pouvoir. However, it is interesting that, semantically,
corpus speaker
voice use sent.
elliptic verb
Percentage of can
interrogative yes
declarative no
state rel-dem/pr
Figure 9. Main eects of the logistic regression
A case for the multifactorial assessment of learner language 
English can is more similar to French pouvoir than to English may, and the subse-
quent regression results provided some initial information on why that is so. More
specically, the way learners choose one of the two verbs is oen compatible with
a processing-based account in terms of the complexity principle – they choose the
more basic and frequent can over may when the environment is complex – but is also
strongly inuenced by the animacy of the subject and the semantics of the verb: can is
overpreferred by learners with animate subjects and with abstract verbs, and under-
preferred with time/place verb semantics.
With regard to the modals per se, our results conrm previous studies’ recogni-
tion of the inuential role of the linguistic context in the uses of may and can. Indeed,
while the main eects included in our nal logistic regression model support studies
that have identied morphosyntactic components such as V and ST as
particularly inuential categories (Leech 1969, 2004; Huddleston 2002; Collins 2009),
our results reveal the necessity to also take the semantic context of modals more seri-
ously, as reected by the strong eects of VT and VS.
More generally speaking and in the parlance of the Competition Model, the clus-
ter analysis and the high classication accuracy of the regression suggest that, on the
whole, the learners have built up mental categories for can and may that are internally
rather coherent. However, the interactions in the regression show that these cues are
weighted incorrectly and sometimes trigger a verb choice that is not in line with na-
tive speaker choices, but that even this kind of incorrect choice is largely predictable
(because the regression can still make the correct classications (cf. Deshors 2010 for
more detailed discussion as well as a distinctive collexeme analysis revealing addi-
tional verb-specic preferences). In other words, even though this is the rst study
involving learner data (and only the second involving dierent languages), the BP ap-
proach and especially the follow-up in terms of the logistic regression are therefore an
interesting diagnostic: (i) the overall results can testify to the strength of the categories
that are being studied, and (ii) the regression with its inclusion of the interactions of
all variables with “native speaker vs. learner” exactly pinpoints where interactions
become signicant, i.e. where the categories of the learner are still substantially dif-
ferent from the native speaker. For further applications and extensions, see Gries and
Wul (2013) for a similar application to the choice of (of- and s-) genitives by native
speakers and learners, and Gries and Deshors (to appear) for an even more advanced
approach to precisely pinpoint where non-native speakers’ choices deviate from those
of native speakers and how much so. Needless to say, more and more rigorous test-
ing is necessary, but to our knowledge this is the rst study proposing this kind of
approach more generally and the use of a regression with a native-learner variable
as a measure of L2 “prociency”; the results illustrate that learners’ “non-nativeness”
manifests itself at all linguistic levels simultaneously.
 Sandra C. Deshors and Stefan . Gries
