ChapterPDF Available

Indicators of lexical growth throughout age, genre and modality for a Catalan L1 corpus

Authors:

Abstract and Figures

Lexical development is a key facet of later, school-age language development. The goal of the study is twofold: on the one hand, to describe quantitatively the text-embedded lexicon of a corpus of texts produced by informants from nine years of age to adulthood, and on the other hand, to identify which characteristics of the lexicon can be considered as indicators of age, discourse genre, and modality of production differentiation. The GRERLI-CAT1 (Grup de Recerca per a l’Estudi del Repertori Lingüístic, Català L1) corpus is constituted by a set of narrative and expository spoken and written texts in Catalan that were produced by 79 bilingual Catalan/Spanish speakers that have Catalan as their home language. They were distributed in five groups according to level of education: elementary, secondary, high school and university level, and language teachers. The corpus comprises 316 texts, which include a total of 84,081 tokens, 40,612 types and 31,811 lemmas. Four measures were applied for characterizing the corpus lexically: (1) lexical diversity, (2) lexical density, (3) word length, and (4) productivity of verbs. Lexical diversity was selected to gauge the range of vocabulary displayed in the texts, lexical density as an indicator of textual richness and informativeness, word length was taken as an indicator of lexical complexity, and productivity of verbs as an indicator of lexical richness within the verbal domain. We discuss the utility of these four measures as indicators of text construction development in different genres and modalities.
Content may be subject to copyright.
LAIA CUTILLAS, LILIANA TOLCHINSKY, ELISA ROSADO, JOAN
PERERA
Indicators of lexical growth throughout age, genre
and modality for a Catalan L1 corpus
Abstract
Lexical development is a key facet of later, school-age language
development. The goal of the study is twofold: on the one hand, to
describe quantitatively the text-embedded lexicon of a corpus of texts
produced by informants from nine years of age to adulthood, and on
the other hand, to identify which characteristics of the lexicon can be
considered as indicators of age, discourse genre, and modality of
production differentiation.
The GRERLI-CAT1 (Grup de Recerca per a l’Estudi del
Repertori Lingüístic, Català L1) corpus is constituted by a set of
narrative and expository spoken and written texts in Catalan that were
produced by 79 bilingual Catalan/Spanish speakers that have Catalan as
their home language. They were distributed in five groups according to
level of education: elementary, secondary, high school and university
level, and language teachers. The corpus comprises 316 texts, which
include a total of 84,081 tokens, 40,612 types and 31,811 lemmas.
Four measures were applied for characterizing the corpus
lexically: (1) lexical diversity, (2) lexical density, (3) word length, and
(4) productivity of verbs. Lexical diversity was selected to gauge the
range of vocabulary displayed in the texts, lexical density as an
indicator of textual richness and informativeness, word length was
taken as an indicator of lexical complexity, and productivity of verbs
as an indicator of lexical richness within the verbal domain. We
discuss the utility of these four measures as indicators of text
construction development in different genres and modalities.
160 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
1. Introduction
The general framework of this study is Corpus Linguistics, which
approaches the study of language empirically. It is based on authentic
samples of language use, so data treatment is external to the speaker,
directly observable and, therefore, demonstrable. It uses computerized
corpora, which represent an important source of quantitative
information. The use of these samples enables to determine frequency
of occurrences and lack of specific linguistic elements (Civit 2003: 2).
Corpus linguistics is a method applicable to every area of linguistic
research, from phonology to discourse. Moreover, corpus-based
research is applicable to linguistic education both in first (L1) and
second (L2) language learning. Native corpora, like the one used in
this study, can be very useful for education professionals because they
show what speakers say or write in a specific communication
situation, as well as their typical difficulties (Nesselhauf 2004: 144-
145). In this study, this methodology is applied to first language
acquisition and, more specifically, to the later, school age language
development of Catalan. Catalan is a Romance language that typically
displays a rich morphological system: nouns and adjectives are
inflected for gender and number; pronouns are inflected for gender,
number, person and case; and verbs are inflected for person/number,
tense/aspect and mood. Moreover, it requires gender and number
agreement between noun and adjective and between pronoun and its
antecedent, as well as person and number agreement between subject
and verb.1
1 Catalan is spoken in four Spanish autonomous communities (Aragon, the
Balearic Islands, Catalonia and the Valencian Community), in the French re-
gion of Rousillon, and in the city of Alghero, on the Italian island of Sardinia.
In Catalonia, Catalan and Spanish are equally recognized as official language,
but Catalan is the language of education. Therefore, children educated in
Catalonia are bilingual (Catalan/Spanish) living in a multilingual environ-
ment: “Due to a major surge of immigration over the past decade (3% in 2000
to 13% in 2008), an increasing percentage of children speak a language differ-
ent from Catalan and Spanish at home” (Llauradó / Tolchinsky in press).
Indicators of lexical growth for a Catalan L1 corpus 161
Several Catalan corpora have been compiled during the recent
years. The CCCUB (Corpus del Català Contemporani de la UB)
(Boix et al. 2002), compiled by the Grup d’Estudi de la Variació
(GEV) at the Universitat de Barcelona (UB), contains texts in
different dialects and registers, but only in one modality of production,
the spoken one. The AnCora-CA (Anotated Corpus-Catalan) (Taulé et
al. 2008) includes written journalistic texts, so it could only be
assessed one genre and one modality. The CesCa (Català Escolar
Escrit a Catalunya) (Llauradó et al. 2012) contains different types of
written texts: narrations of a film storyline, recommendations of a
film, or definitions of words and jokes, but only in one modality of
production. These two latter corpora are compiled by the Centre de
Llenguatge i Computació (CLiC) of the UB. The CICA (Corpus
Informatitzat del Català Antic) (Torruella 2009), compiled by the
Grup de Lexicografia i Diacronia (SFI) of the Universitat Autònoma
de Barcelona (UAB), is a diachronic corpus which includes books
written from the 12th to the 15th centuries. The CTILC (Corpus
Textual Informatitzat de la Llengua Catalana) (Rafel / Solanellas
1986), compiled by the Institut d’Estudis Catalans (IEC) with
lexicographic purposes, contains both literary and non- literary texts in
different genres but, again, only in the written modality. The
GRERLI-CAT1 corpus, compiled by the GRERLI (Grup de Recerca
per a l’Estudi del Repertori Lingüístic) at the UB, includes texts in
different genres (narrative and expository) and modalities of
production (spoken and written), produced by subjects of different
school levels (elementary, secondary and high school, university level
and language teachers).
The goals of this paper are (1) to analyse quantitatively the
GRERLI-CAT1 corpus text-embedded lexicon and (2) to identify
which characteristics of the lexicon can be considered as indicators of
lexical growth throughout age, genre, and modality. With this
purpose, four measures are evaluated: lexical diversity, lexical
density, word length and productivity of verbs.
Lexical diversity (Malvern et al. 2004: 19) is selected because it
can gauge the range of vocabulary use. Lexical density (Read 2000:
200) is considered a good indicator of textual richness and
informativeness. Word length (Strömqvist et al. 2002: 48), is chosen
162 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
as a measure of lexical complexity. Productivity of verbs is selected as
an indicator of lexical richness within the verbal domain. This
measure embraces both the diversity of verbal lemmas, that is, the
number of different verbal lemmas used in the corpus, and the
productive use of inflectional morphology of verbs, that is, the number
of different tense forms (Serrat et al. 2004: 221).
2. Corpus compilation
The GRERLI-CAT1 corpus was compiled within the framework of a
cross-linguistic project designed to analyse the development of text
construction abilities in different languages. The corpus was compiled
in 1998 as part of the international project Developing literacy in
different contexts and in different languages, Spencer Foundation,
Chicago, United States of America (P.I.: R. A. Berman, Tel Aviv
University). The languages that took part in this study were Dutch, US
English, French, Hebrew, Icelandic, Spanish, and Swedish, with the
subsequent addition of Catalan. The main goals of this project were to
understand how speakers/writers develop their discursive abilities
throughout different educational levels, to analyse how they use the
resources of their respective languages to construct discourse in
different genres (expository and narrative) and modalities of
production (spoken and written), and, finally, to find out common and
language-specific patterns of language use when producing spoken
and written texts (Berman 2002; Aparici 2010).
Language usage, both spoken and written, is always framed
within a specific discourse genre. As Tolchinsky (2004: 235) says:
There is no such thing as neutral use of language: people constantly attune
their speech to specific intentions, purposes, and interlocutors. Thus,
development is viewed in terms of the acquisition of different discourse
genres, and the way that the cultural conventions of genres constrain the use
of linguistic forms.
Indicators of lexical growth for a Catalan L1 corpus 163
The two genres selected for the project were expository and narrative.
Several reasons lead to the selection of these two genres. Firstly,
because the narrative genre has been vastly studied from a
developmental psycholinguistic point of view while the expository
genre has not, since it is a late developing genre, and secondly,
because of the contrast in personal involvement that characterises
these two genres. A first person narrative and the discussion of a topic
illustrate the extremes of a continuum of personal involvement.
Personal involvement of the speaker/writer is demonstrated, among
other things, in the election of the main characters. In personal
narratives the main character is the speaker/writer and, in expository
texts, the topic gains prominence (Aparici 2010).
Regarding modalities of production, spoken and written texts,
differ from the point of view of the demands imposed by processing.
The spoken modality is constrained by on-line processing and can
cause mistakes, breaks, repetitions, reformulations or lack of referents.
On the other hand, the written modality is not subject to time
limitations. It enables the writer to plan thoroughly what he or she
wants to write, to outline or to proofread. For this reason, written texts
are prone to be better organized and be more cohesive and coherent
than spoken texts.
Informants were selected from nine years of age onwards
because at this age they have already attained a command of narrative
structure and of the mechanisms of discursive cohesion (Berman /
Slobin 1994). The sample also included adults because the
development of discursive abilities is a long process that spans from
childhood to beyond adolescence (Berman / Verhoeven 2002: 14).
2.1. Informants
The texts were produced by 79 informants from Barcelona, who were
bilingual speakers of Catalan and Spanish and whose home language
164 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
is Catalan.2 Participants were distributed in five groups, according to
their educational level and age. The elementary school group
comprises 20 students from the 4th year elementary school. The
secondary school group comprises 19 students of the 1st year
secondary school. The high school group comprises 20 students of the
1st year high school. The university group comprises 10 students from
different courses and specialities, specifically, four Science students
and six Humanities students. Finally, the teachers group comprises 10
high school Catalan language teachers. This group is expected to
represent a level of professional use of the language.
Table 1 shows data of all informants’ groups:
Elementary
School
Secondary
School
High School
University
Teachers
n = 20
n = 19
n = 20
n = 10
n = 10
M
Range
M
Range
M
Range
M
Range
M
Range
10;3.
9;6. -
10;9.
13;0.
12;4. -
13;4.
17;4.
16;3. -
18;10.
21;9.
19;1. -
24;2.
43;10.
37;8. -
54;11.
Table 1. Number of informants, mean age and age range.
2.2. Tasks
All informants produced four texts. After watching a three-minute
video without text, participants were asked to produce a spoken
expository text, a written expository text, a spoken narrative text and a
written narrative text. The video shows different conflictive situations
in schools, such as fights, shunning classmates, cheating in exams, etc.
The purpose was that all texts produced by the informants had a
common topic to enable the comparison of the text linguistic
characteristics.
2 It is unlikely to find monolingual speakers of Catalan, since Catalan and Span-
ish are both official languages in Catalonia. All children use Catalan at school
and Spanish is massively present both in the media and in social settings.
Indicators of lexical growth for a Catalan L1 corpus 165
2.3. Procedure
Informants produced the four tasks individually. Data were elicited in
two sessions and four different orders of text production were
followed: A (first session: spoken narrative/written narrative; second
session: spoken expository/written expository); B (first session:
written narrative/spoken narrative; second session: written
expository/spoken expository); C (first session: spoken
expository/written expository; second session: spoken
narrative/written narrative); and D (first session: written
expository/spoken expository; second session: written
narrative/spoken narrative).
2.4. Corpus storage
We have used two different versions of the corpus: clean (net) and
morphologically tagged (morfo). In the following sub-sections, the
characteristics of these two corpus versions will be described.
2.4.1. Clean version (net)
This version consists of text files without format which contain the
production with no additional symbols or indications of spelling
mistakes. All marking of ancillary material, such as repetitions,
reformulations, pauses and comments, is omitted. The transcription
unit is the clause. That is, each text line in this version corresponds to
a clause. We follow Berman and Slobin’s (1994: 660) definition of
this unit:
We define a clause as any unit that contains a unified predicate. By unified,
we mean a predicate that expresses a single situation (activity, event, state).
Predicates include finite and non-finite verbs, as well as predicate adjectives.
This version aims at producing an input file for the morphological
analyser HSMorfo, which performs the morphological tagging. The
counts of tokens and types were computed using the net version.
166 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
2.4.2. Morphologically-tagged version (morfo)
This is a text version that results from the net version of the corpus
after having been morphologically tagged using the HSMorfo Linux
software.3 Files of the morphologically analysed texts contain three
columns: the first one represents the type, the second one the lemma
and the third one includes the EAGLES (Expert Advisory Group on
Language Engineering Standards) tag (EAGLES 1996), which shows
the grammatical category of the type. The measures for the four
analysis dimensions of the current paper were calculated using the
morfo version.
2.4.3. Other versions
There are three more versions of the corpus: mirror (rep), normalized
(nor) and CHAT (cha). The mirror version (only for writing texts)
consists in the transcription of the written productions into a MS Word
file, with the original disposition of paragraphs and lines, punctuation
and spelling. In the normalized version, texts (both spoken and
written) are transcribed in CHAT (Codes for the Human Analysis and
Transcripts) format (MacWhinney 2012). Spoken productions are
transcribed orthographically (not phonetically) including processing
information (pauses, repetitions, reformulations, etc.). In written texts,
following CHAT conventions, spelling mistakes are followed by the
correct word (e.g. vastant [: bastant] ‘enough’). Finally, the CHAT
version only differs from the nor version in that the transcription unit
is the clause.
3 It is important to remark that some files have been analysed using the
FreeLing software, because they included a character which provoked a mal-
functioning on HSMorfo and it finished the analysis before arriving to the
end. However, the working method of both programs is the same, because
they use the same analysis process, as well as EAGLES tags on the tagging
process. They only differ in the treatment of the apostrophized words
(FreeLing attaches them to the next word and HSMorfo separates them) and in
the contractions (FreeLing separates them into two words and HSMorfo treats
them as one word). These treatments have been subsequently fixed using a
software specially designed to correct these errors.
Indicators of lexical growth for a Catalan L1 corpus 167
2.5. Corpus processing
The net version provides the input files for the HSMorfo Linux
software, which performs the morphological analysis and tagging, the
morfo version of the corpus. The HSMorfo software uses the tags
developed by the EAGLES group for morphosyntactic annotation of
European languages, which are adapted to Catalan in Civit (2003).
Each position of the tag corresponds to a feature (whose number
varies for each category), the value of this feature and a code
representing them. Zero shows that a given feature has no value.
Below are some examples of the morphological analysis of three
specific cases of segmentation, which could be a problem for
morphological analysers:
Multi word expressions, linked by an underscore (_), such as punt
de vista ‘point of view’:
punt de vista
1. punt_de_vista punt_de_vista (‘point of view’) NCMS000
(Noun Common Masculine Singular)
Words containing a dash (-) and/or an apostrophe (’):
a) Words with clitic pronouns, which are attached to a verb
using a dash and/or an apostrophe, such as controla’ls-ho
‘control it to them’:
controla’ls-ho
1. controla controlar (‘to control’) VMM02S0 (Verb Main
Imperative 2nd Singular)
2. ’ls ell (‘he’) PP3CP000 (Personal Pronoun 3rd Common
Plural)
3. -ho ho (‘it’) PP3NN000 (Personal Pronoun 3rd Neuter
Invariable)
b) Apostrophized words, such as l’únic ‘the unique’:
l’únic
1. l’ el (‘the’) DA0CS0 (Determinate Article Common
Singular)
2. únic únic (‘unique’) AQ0MS0 (Adjective Qualifying
Masculine Singular)
168 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
Contractions (prepositions attached to an article) are treated by the
analyser as compound prepositions, and they include grammatical
information about gender and number. In other words, they
present the preposition features (three first values of the tag) and
the article features (the two last values of the tag), such as dels ‘of
the’:
dels
1. dels del (‘of the’) SPCMP (Adposition Preposition
Complex Masculine Plural)
As we have seen in this section, the characteristics of the GRERLI-
CAT1 corpus are as follows: it contains 316 texts, which include a
total of 84,081 tokens, 40,612 types and 31,811 lemmas. The main
features that identify this corpus are: (i) Language: the corpus contains
texts in Catalan; (ii) Developmental dimension: the corpus reflects
later, school age language development, because it contains data
produced by informants from nine years old to adulthood; (iii) Genre:
it contains texts in two discourse genres: narrative and expository; (iv)
Modality: it includes texts of two modalities of production: spoken
and written; and finally, (v) Corpus storage: the corpus is stored in
five different versions that can serve as input for different
computational platforms. Moreover, because the corpus is part of a
cross-linguistic project, in which the same data elicitation procedure
was used for the different languages, it can be compared with the
corpora in the other languages that took part in the study.
3. Dimensions of analysis
In order to identify the characteristics of the lexicon that can be
considered as indicators of age, discourse genre and modality of
production differentiation, the selected measures are: (3.1.) lexical
diversity, calculated using type-token, lemma-token and lemma-type
ratios; (3.2.) lexical density, i.e. the proportion of content words
Indicators of lexical growth for a Catalan L1 corpus 169
relative to the total number of words; (3.3.) word length, i.e. the
number of letters of each lexical word; and (3.4.) productivity of
verbs, characterised by the diversity of verbal lemmas used and the
productive use of inflectional morphology of verbs, i.e. the number of
verb tenses associated to each verbal lemma.
3.1. Lexical diversity
Lexical diversity (Malvern et al. 2004) is a widely used measure of
language development, because it can gauge the range of vocabulary
use. Although the type-token ratio is by far the most commonly used
measure of lexical diversity, it is much less useful in measuring
vocabulary of inflectional languages like Catalan than another
measure, the lemma-token ratio, which provides more accurate
information about lexical diversity, as Granger / Wynne (2000: 251)
say:
A learner who uses five different forms of the verb go (go/goes/going/
gone/went) in one and the same text has a less varied vocabulary than the one
who uses five different lemmas (such as go/come/leave/enter/return).
Moreover, a parameter that calculates the correlation between lemmas
and types, the lemma-type ratio, can also be useful to study lexical
diversity of Catalan. The type-token ratio is obtained by dividing the
number of types by the number of tokens, the lemma-token ratio by
dividing the number of lemmas by the number of tokens; and finally,
the lemma-type ratio by dividing the number of lemmas by the
number of types.
3.2. Lexical density
Lexical density (Read 2000) is considered to be a good indicator of
textual richness. If content words convey the vast bulk of semantic
content, then the relative proportion of lexical words used in a text can
provide an idea of its informativeness. Lexical density is obtained by
170 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
dividing the number of lexical words (nouns, main verbs, adjectives
and adverbs) by the total of words in a sample.
3.3. Word length
Word length (Strömqvist et al. 2002) is regarded as an indicator of
lexical complexity. The longer words are supposed to be
derivationally more complex words, so word length is a suitable
measure of lexical complexity. Function words (auxiliary verbs,
determinants, pronouns, prepositions, conjunctions and interjections)
have been removed from this count because of their grammatical
status. Means for word length are obtained by counting the number of
letters in each lexical word over the total number of words.
3.4. Productivity of verbs
Productivity of verbs is seen here as an indicator of lexical richness
within the verbal domain. In the context of the present study we
consider the diversity of verbal lemmas (number of different lexical
verbs that appear in the corpus) apart from the productive use of
the inflectional verb morphology (number of different verb forms –
marking of tense/aspect or moodfor each specific verbal lemma).
Diversity of verbal lemmas is obtained by dividing the total of verbal
lemmas by the number of different verbal lemmas used in a sample.
Productive use of inflectional morphology (Serrat et al. 2004) is
calculated by counting the number of different tenses used for each
verbal lemma. For these counts, differences in person/number
inflection have not been considered, and the non-finite forms of the
verb (infinitive, participle and gerund) have only been considered as
part of a compound verb. That is, tenses formed by an auxiliary verb
(haver ‘to have’ or anar ‘to go’, and ser ‘to be’, for passive forms)
plus a non-finite form of the main verb.
Indicators of lexical growth for a Catalan L1 corpus 171
4. Results
4.1. Quantitative description of the corpus
Table 2 presents the distribution of tokens, types and lemmas by text
type. There are some differences between both the two discourse
genres and the two modalities of production. Expository texts are
significantly longer than narrative texts, both spoken and written: (F
(1, 74) = 19.024, p = .000) for tokens, (F (1, 74) = 22.661, p = .000)
types, (F (1, 74) = 24.078, p = .000) and lemmas, respectively. Spoken
texts in both genres have significantly more tokens, types and lemmas
than written texts: (F (1, 74) = 25.669, p = .000) for tokens, (F (1, 74)
= 12.080, p = .001) types, (F (1, 74) = 12.556, p = .001) and lemmas,
respectively. By text types, spoken expository has more tokens than
spoken narrative and written expository than written narrative. Types
and lemmas present a different pattern: expository texts, both spoken
and written, have more types and lemmas than narrative ones. A
significant interaction between genre and modality is also found for
tokens (F (1, 74) = 11.423, p = .001), types (F (1, 74) = 11.113, p =
.001) and lemmas (F (1, 74) = 8.451, p = .005).
Table 3 presents the distribution of tokens, types and lemmas by
age group. There is a significant effect of age: (F (4, 74) = 13.910, p =
.000) for tokens, (F (4, 74) = 24.176, p = .000) types, (F (4, 74) =
24.047, p = .000) and lemmas, respectively. High school group has the
highest number of tokens, types and lemmas, followed by the older
groups, university and teachers. The next group is secondary school
and elementary school is the age group with the lowest number of
tokens, types and lemmas. So the counts of tokens, types and lemmas
increases gradually from elementary to high school, and then
decreases progressively from high school to university and then from
university to teachers. However, the means of tokens, types and
lemmas show a different pattern: there is an increase from elementary
school to university, but then the means of tokens, types and lemmas
decrease in the teachers’ group. An interaction is found between genre
and age for tokens (F (4, 74) = 4.660, p = .002), types (F (4, 74) =
172 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
3.748, p = .008) and lemmas (F (4, 74) = 3.732, p = .008). The
number of tokens also shows an interaction between modality and age
(F (4, 74) = 2.574, p = .045).
Counts
M
SD
Spoken
Expository
30,315
383.73
406.69
12,269
155.30
107.77
9,479
119.99
79.64
Written
Expository
16,929
214.29
143.39
9,828
124.41
65.19
7,791
98.62
51.38
Spoken
Narrative
22,151
280.39
275.56
9,730
123.16
77.97
7,658
96.94
59.63
Written
Narrative
14,686
185.90
104.56
8,785
111.20
53.98
6,883
87.13
41.98
Table 2. Tokens, types and lemmas distribution by text type.
Counts
M
SD
Elementary
School
11,223
140.28
57.63
6,356
79.45
25.62
5,013
62.66
18.63
Secondary
School
12,200
160.52
71.70
6,592
86.73
28.19
5,137
67.59
19.55
High School
26,082
326.02
192.00
12,059
150.73
56.26
9,252
115.65
41.66
University
19,627
490.67
282.00
8,241
206.02
69.98
6,460
161.50
53.38
Teachers
14,949
373.72
75.37
7,364
184.10
30.50
5,949
148.72
25.90
Table 3. Tokens, types and lemmas distribution by age.
Indicators of lexical growth for a Catalan L1 corpus 173
4.2. Lexical diversity
Figure 1 illustrates the results by text type of the three measures
selected for characterising lexical diversity: types per token, lemmas
per token and lemmas per type ratios. Firstly, the results of the type-
token ratio are presented. Regarding the differences between the two
discourse genres, no significant effect of genre is found. As for
modality, type-token ratio is significantly higher in the written than in
the spoken texts (F (1, 74) = 175.278, p = .000). By text types, written
texts, both expository and narrative, have the highest type-token ratio,
followed by spoken narrative and spoken expository. There is an
almost significant interaction between genre and modality (F (1, 74) =
3.568, p = .063). Secondly, the lemma-token ratio results show that,
like in the type-token ratio, genre has no significant effect. The
lemma-token ratio is significantly higher in the written than in the
spoken modality (F (1, 74) = 100.689, p = .000). By text types, the
lemma-token ratio is higher in spoken narrative texts than in spoken
expository, but in written narrative is lower than in written expository.
This ratio shows an interaction between discourse genre and modality
of production (F (1, 74) = 4.349, p = .040). Finally, the results for the
lemma-type ratio are presented. This ratio is not affected by genre or
modality. Spoken narrative texts have the highest lemma-type ratio,
followed by written expository, spoken expository and written
narrative. An almost significant interaction between genre and
modality is found (F (1, 74) = 3.529, p = .064).
Figure 2 shows type-token, lemma-token and lemma-type ratios
by age group. The type-token ratio is significantly affected by age (F
(4, 74) = 10.423, p = .000), it decreases from elementary school to
university but it increases in the teachers’ group. This result indicates
a higher lexical diversity in the oldest age group, so an interaction
both between genre and age (F (4, 74) = 5.349, p = .001) and between
modality and age (F (4, 74) = 4.608, p = .002) is found. Regarding the
lemma-token ratio, it trends similarly: there is a significant effect of
age (F (4, 74) = 7.310, p = .000), there is an interaction between genre
and age (F (4, 74) = 4.070, p = .005) and between modality and age (F
(4, 74) = 4.422, p = .003). Finally, the lemma-type ratio only shows an
almost significant effect of age (F (4, 74) = 2.252, p = .071).
174 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
Figure 1. Type-token-lemma ratios by text type.
Figure 2. Type-token-lemma ratios by age group.
Indicators of lexical growth for a Catalan L1 corpus 175
4.3. Lexical density
Table 4 presents lexical density by text type; roughly, almost half of
the tokens used on each text type are lexical tokens. Expository texts
have significantly higher lexical density than narrative ones, (F (1, 74)
= 30.151, p = .000). No significant effect of modality is found. By text
types, lexical density is higher in spoken expository texts than in
spoken narrative, but in written narrative is higher than in written
expository texts. There is a significant interaction between genre and
modality (F (1, 74) = 4.524, p = .037).
Table 5 presents the results by age group. Lexical density is
significantly affected by age (F (4, 74) = 10.193, p = .000). No
interactions between genre and age or between modality and age are
found. We can observe from the counts of lexical density that, except
for the secondary school group, lexical density shows a developmental
pattern. However, since the mean lexical density of each age group
increases gradually, we run Bonferroni post-hoc comparisons in order
to determine between which age groups significant differences are
found. Regarding discourse genres, significant differences in
expository texts are between the youngest and the oldest groups, that
is, between elementary school (M = .484, SD = .042) and teachers (M
= .527, SD = .032) (d = .043); and also between secondary school (M
= .470, SD = .036) and both university (M = .509, SD = .020) (d =
.039) and teachers (d = .057). For narrative texts, Bonferroni post-hoc
analyses show that significant differences are to be found between
elementary school (M = .431, SD = .044) and the three oldest groups,
that is, high school (M = .469, SD = .029) (d = .038), university (M =
.493, SD = .018) (d = .061) and teachers (M = .488, SD = .021) (d =
.056); and also between secondary school (M = .446, SD = .048) and
both university group (d = .046) and teachers (d = .041). As for
modality, significant differences in spoken texts are found between the
youngest and the oldest groups, that is, between elementary school (M
= .460, SD = .028) and both university group (M = .506, SD = .021) (d
= .045) and teachers (M = .505, SD = .025) (d = .044); and also
between secondary school (M = .456, SD = .038) and both university
group (d = .049) and teachers (d = .048). Finally, for written texts,
significant differences are found between elementary school (M =
176 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
.455, SD = .040) and the two oldest groups, that is, university group
(M = .496, SD = .021) (d = .041) and teachers (M = .509, SD = .025)
(d = .054); and also between secondary school (M = .459, SD = .035)
and teachers (d = .049).
Counts
M
SD
Spoken
Expository
Tokens
30,315
383.73
406.69
Lexical tokens
15,071
190.92
210.85
Lexical density
.497
.487
.040
Written
Expository
Tokens
16,929
214.29
143.39
Lexical tokens
6,960
107.06
73.19
Lexical density
.411
.495
.044
Spoken
Narrative
Tokens
22,151
280.39
275.56
Lexical tokens
10,527
133.53
138.20
Lexical density
.475
.463
.042
Written
Narrative
Tokens
14,686
185.90
104.56
Lexical tokens
6,897
86.25
52.78
Lexical density
.469
.455
.053
Table 4. Lexical density by text type.
Counts
M
SD
Elementary
School
Tokens
11,223
140.28
57.63
Lexical tokens
5,182
64.23
27.36
Lexical density
.461
.457
.031
Secondary
School
Tokens
12,200
160.52
71.70
Lexical tokens
4,122
73.96
34.42
Lexical density
.337
.458
.031
High School
Tokens
26,082
326.02
192.00
Lexical tokens
12,627
157.72
97.27
Lexical density
.484
.481
.022
University
Tokens
19,627
490.67
282.00
Lexical tokens
9,902
247.32
142.20
Lexical density
.504
.501
.016
Teachers
Tokens
14,949
373.72
75.37
Lexical tokens
7,590
190.82
41.95
Lexical density
.507
.507
.021
Table 5. Lexical density by age group.
Indicators of lexical growth for a Catalan L1 corpus 177
4.4. Word length
Figure 3 shows the results of word length by text type. Word length
shows significant differences between the genres and modalities
studied. Regarding the differences between genres, expository texts
have a higher word length than narrative texts, there is a significant
effect of genre (F (1, 74) = 19.614, p = .000). As for modality, written
texts have a higher word length than spoken texts, and significant
differences are found (F (1, 74) = 68.432, p = .000). By text type,
written expository texts have the longest words, followed by spoken
expository, written narrative and spoken narrative; there is a significant
interaction between genre and modality (F (1, 74) = 4.478, p = .038).
Figure 4 illustrates the word length results by age group. Word
length is significantly affected by age (F (4, 74) = 3.728, p = .000).
There is also a significant interaction between genre and age (F (4, 74)
= 4.707, p = .002) as well as between modality and age (F (4, 74) =
7.633, p = .000). This measure shows a clear developmental pattern,
because it increases throughout the subjects’ age, presenting a highest
increase between university and teachers groups.
Figure 3. Word length by text type.
178 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
Figure 4. Word length by age group.
4.5. Productivity of verbs
4.5.1. Diversity of verbal lemmas
Table 6 shows the diversity of verbal lemmas by text type. No
significant effect of genre is found. Written texts have higher diversity
of verbal lemmas than spoken ones (F (1, 74) = 114.586, p = .000). By
text type, written expository is the text type with the highest diversity
of verbal lemmas, .658, then spoken expository, with a diversity of
.516, closely followed by written narrative, .508, and finally, spoken
narrative is the text type with the lowest diversity of verbal lemmas,
.463. Table 7 shows the diversity of verbal lemmas by age group.
There is a significant effect of age, (F (4, 74) = 4.117, p = .005). The
results show that diversity increases gradually, except for the
university group. There is a significant interaction between genre and
age (F (4, 74) = 3.659, p = .009) as well as between modality and age
(F (4, 74) = 5.549, p = .001).
Indicators of lexical growth for a Catalan L1 corpus 179
Counts
M
SD
Spoken
Expository
Verbal lemmas
635
77.61
83.78
Different verbal lemmas
328
28.63
18.64
Diversity
.516
.471
.157
Written
Expository
Verbal lemmas
407
40.59
25.77
Different verbal lemmas
268
22.76
12.24
Diversity
.658
.597
.116
Spoken
Narrative
Verbal lemmas
621
61.09
53.90
Different verbal lemmas
288
24.27
14.65
Diversity
.463
.454
.125
Written
Narrative
Verbal lemmas
584
39.92
20.64
Different verbal lemmas
297
22.16
10.71
Diversity
.508
.575
.115
Table 6. Diversity of verbal lemmas by text type.
Counts
M
SD
Elementary
School
Verbal lemmas
323
33.02
14.57
Different verbal lemmas
154
16.76
5.87
Diversity
.476
.573
.096
Secondary
School
Verbal lemmas
301
35.14
16.81
Different verbal lemmas
151
17.23
6.41
Diversity
.501
.555
.089
High
School
Verbal lemmas
640
68.35
37.14
Different verbal lemmas
327
27.91
8.67
Diversity
.510
.529
.082
University
Verbal lemmas
597
96.12
54.21
Different verbal lemmas
285
37.92
14.13
Diversity
.477
.529
.035
Teachers
Verbal lemmas
546
67.30
11.05
Different verbal lemmas
293
33.17
6.10
Diversity
.536
.605
.033
Table 7. Diversity of verbal lemmas by age group.
4.5.2. Productive use of inflectional morphology of verbs
The productive use of inflectional morphology by text type presents
no significant differences between discourse genres and modalities of
production. Verbal lemmas with the least productive use of
inflectional morphology, that is, verbs which appear in only one tense,
180 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
represent more than a half of the total in all text types. Spoken
narrative is the text type which shows a more productive use of
inflectional morphology, verbs like anar ‘to go’ and dir ‘to say’ are
used in 11 different tenses. Written expository is the text type which
has less productive use of inflectional morphology, verbs like fer ‘to
do’ and veure ‘to see’, are used in seven different tenses.
The productive use of inflectional morphology by age group
presents no significant differences. More than a half of verbal lemmas
have the least productive use of inflectional morphology, they appear
in only one tense. Elementary school is the age group that shows a less
productive use of inflectional morphology, for example, the verb dir
‘to say’ is used in eight different tenses. Secondary school group has a
similar productive use of inflectional morphology, for instance, the
verb tenir ‘to have got’ is used in nine tenses. The oldest age groups
have more productive use of inflectional morphology, for example,
the verb dir ‘to say’ is used in 11 tenses in all these three groups.
Verbal lemmas which have a higher productive use seem to be
high frequency verbs (e.g. dir ‘to say’, fer ‘to do’, tenir ‘to have got’).
By contrast, verbal lemmas with a lower productive use seem to be
low frequency verbs (e.g. traumatitzar ‘to traumatize’, senyorejar ‘to
dominate’, violentar ‘to embarrass’). We have looked up in the DdF
(Diccionari de Freqüències, ‘Frequency List’) (Rafel 1998) of the
Institut d’Estudis Catalans, which is based on the CTILC (Corpus
Textual Informatitzat de la Llengua Catalana) and includes literary
and non-literary written texts in Catalan, in order to compare the most
frequent verbal lemmas of the GRERLI-CAT1 corpus with the
frequency of the same verbal lemmas in the CTILC corpus. Table 8
shows the 10 most frequent verbal lemmas (from highest to lowest) of
the GRERLI-CAT1 corpus on the first column; the second represents
their absolute frequency, that is, the number of occurrences of tokens
of this verbal lemma related to the total of tokens of the GRERLI-
CAT1 corpus (84,081). The third column shows their relative
frequency, that is, the percentage of the lemma representation to the
total of tokens of the corpus. In the fourth column, the same verbal
lemmas are presented from highest to lowest frequency, according to
the DdF; the fifth column represents their absolute frequency related
to the total of tokens of the CTILC corpus (52,375,044); and, finally,
Indicators of lexical growth for a Catalan L1 corpus 181
the sixth column shows their relative frequency. As we can see, the
results are quite similar, if we take into account the huge difference
between the number of tokens of both corpora. This confirms that
verbs with more productive use are the most frequently used.
Verbal lemma
GRERLI-CAT1
Absolute
frequency
Relative
frequency
Verbal lemma
CTILC
Absolute
frequency
Relative
frequency
dir ‘to say’
48
3.967
ser ‘to be’
1,000,352
1.951
fer ‘to do’
44
3.636
fer ‘to do’
338,070
.659
ser ‘to be’
37
3.058
tenir
‘to have got’
236,795
.462
anar ‘to go’
31
2.562
dir ‘to say’
222,380
.433
haver ‘to have’
26
2.149
poder ‘can’
211,688
.413
passar
‘to happen’
23
1.901
anar ‘to go’
114,340
.223
poder ‘can’
23
1.901
estar ‘to be’
111,112
.216
tenir
‘to have got’
17
1.405
passar
‘to happen’
64,499
.125
estar ‘to be’
14
1.157
començar
‘to start’
31,259
.060
començar
‘to start’
8
.661
haver ‘to have’
1,984
.003
Table 8. Comparative between the 10 most frequently used verbal lemmas of the
GRERLI-CAT1 corpus and the results of the DdF.
5. Conclusion
Our study enables us to distinguish the specific contribution of four
lexicon-related measures for the corpus-based study of lexical
development. The results show that word length offers the best
diagnosis of lexical development, genre and modality differentiation.
Lexical density is a good indicator of developmental changes and
genre differentiation, but not of modality differentiation. Lexical
diversity, by contrast, is a good indicator of developmental changes
and modality differences, but not of genre differentiation. Similarly,
182 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
diversity of verbal lemmas is also an indicator of developmental
changes and modality differences. Finally, productive use of
inflectional morphology is not appropriate for characterizing later,
school age language development of lexicon.
Word length appears as a valid measure to differentiate between
school level, discourse genres and modalities of production. There are
significant differences between age groups, effects of genre and
modality and also an interaction between genre and age, modality and
age and also between genre and modality. Word length shows a clear
developmental pattern from childhood to adulthood. Moreover, word
length is significantly higher in the expository genre than in the
narrative one, and in written compared to the spoken modality. In line
with other studies, such as Stömqvist et al. (2002), for Swedish
language, or Llauradó / Tolchinsky (in press), for Catalan language,
word length appears as the best diagnosis of development, genre and
modality differences.
Lexical density is a valid measure to differentiate between
school level and discourse genres. Significant differences between age
groups and an effect of genre are found, and there is also an
interaction between genre and modality. Lexical density yielded no
clear developmental pattern, like the ones found in previous studies
(Llauradó / Tolchinsky in press), though we found differences
between the youngest and the oldest groups. As for differences by
genre, expository texts are denser than narrative ones. However, and
unlike other studies in which lexical density showed significant
differences between written and spoken texts (Strömqvist et al. 2002),
we found no significant effect of modality of production.
Lexical diversity, measured by type-token, lemma-token and
lemma-type ratios, is a valid measure for revealing developmental
changes and modality differences. However, lexical diversity does not
account for genre differentiation. In both type-token and lemma-token
ratios, there is a significant effect of age and modality, and
interactions between genre and age, modality and age and also
between genre and modality. Regarding the lemma-type ratio, no
significant differences between genres, modalities and age groups are
found. The highest lexical diversity is found in the oldest age group, in
line with the findings of other studies (Berman / Verhoeven 2002;
Indicators of lexical growth for a Catalan L1 corpus 183
Strömqvist et al. 2002). Written texts have a higher type-token and
lemma-token ratios than spoken texts. This finding also corroborates
those found for type-token ratio for Swedish by Strömqvist et al.
(2002).
Finally, productivity of verbs functions halfway to our
purposes. As for diversity of verbal lemmas, there is a significant
effect of age and modality of production, and interactions between
genre and age and between modality and age are also observed.
Diversity of verbal lemmas increases gradually through age groups,
except for the university group. Written texts have higher diversity
than spoken ones. As for the productive use of inflectional
morphology of verbs, it does not seem as a valid measure to
differentiate between school level, discourse genres and modalities of
production. Nine-year-olds and adults display a similar use in the
production of inflectional morphology of verbs.
In sum, for our purposes, the most suitable measure is word
length, because it can characterise differences in all variables: age,
genre and modality. Lexical density is appropriate to differentiate
between ages and discourse genres, and lexical diversity reveals as
useful measure to differentiate between ages and modalities of
production. Finally, diversity of verbs can serve to differentiate
between ages and modalities, but productive use of inflectional
morphology of verbs is not a valid indicator of text construction
development in different genres and modalities.
The characteristics of the GRERLI-CAT1 corpus will allow us
to explore the type of lexicon used by different age groups, as well as
the syntactic organization of different types of texts, the textual
components of each discourse genre or the effect of the order of text
production (Cutillas in press), among others. Cross-linguistic
comparisons with the other languages of the main project will enable
us to analyse the similarities and differences between these languages,
as well. Moreover, comparisons with the same corpus for Catalan as
L2 (GRERLI-CAT2) will provide us with valid data to characterise
Catalan language development (L1) and language learning (L2).
184 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
Acknowledgements
Data for this paper were gathered within the I+D research project
Hacia el dominio experto de la lengua: estudio comparado del
desarrollo del repertorio lingüístico nativo y no nativo en castellano y
catalán ‘Becoming an expert user of language: a comparative study of
the development of native and non-native linguistic repertoire in
Spanish and Catalan’ funded by the Spanish Ministry of Science and
Innovation, reference code EDU2009-08862.
References
AnCora-CA corpus. <http://clic.ub.edu/corpus/ancora/>
Aparici, Melina 2010. El desarrollo de la conectividad discursiva en
diferentes géneros y modalidades de producción [The
Development of Discursive Connectivity in Distinct Genres and
Modalities of Production] (Unpublished PhD dissertation).
Barcelona: Universitat de Barcelona.
Berman, Ruth A. / Slobin, Dan 1994. Relating Events in Narrative: A
Crosslinguistic Developmental Study. Mahwah: Lawrence
Erlbaum Associates.
Berman, Ruth A. / Verhoeven, Ludo 2002. Cross-linguistic
Perspectives on the Development of Text Production Abilities:
Speech and Writting. Written Language and Literacy. 5/1, 1-44.
Boix, Emili / Alturo, Núria / Perea, Maria P. 2002. Corpus del Català
Contemporani de la Universitat de Barcelona (CUB). A General
Presentation. Romanistische Korpuslinguistik. Korpora und
gesprochene Sprache. Tübingen: Gunter Narr, 155-170.
<http://www.ub.edu/ cccub/>.
CesCa corpus. <http://clic.ub.edu/corpus/corpus/cesca/>
CICA corpus. <http://www.cica.cat/>
Indicators of lexical growth for a Catalan L1 corpus 185
Civit, Montserrat 2003. Criterios de desambiguación morfosintáctica
de corpus en español [Morphosyntactic Disambiguation
Criteria of Corpora in Spanish]. Alicante: Sociedad Española
para el Procesamiento del Lenguaje Natural.
CTILC corpus. <http://ctilc.iec.cat/>
Cutillas, Laia In press. Parlar per escriure o escriure per parlar? [Talk
for Writing or Write for Talking?]. In L. Tolchinsky (ed.) Cap a
una explotació didàctica dels corpus lingüístics [Towards an
Educational Explotiation of Linguistic Corpora]. Barcelona:
Horsori.
EAGLES 1996. Recommendations for the Morphosyntactic
Annotation of Corpora. <http://www.ilc.cnr.it/EAGLES96/
annotate/annotate.html>.
Granger, Sylviane / Wynne, Martin 2000. Optimising Measures of
Lexical Variation in EFL Learner Corpora. In Granger,
Sylviane / Wynne, Martin (eds) Corpora Galore: Analyses and
Techniques in Describing English. Amsterdam: Rodopi, 249-
257.
Llauradó, Anna / Tolchinsky, Liliana In press. The Growth of the
Text-embedded Lexicon in Catalan from Chilhood to
Adolescence. First Language.
Llauradó, Anna / Martí, Maria Antònia / Tolchinsky, Liliana 2012.
Corpus CesCa: Compiling a Corpus of Written Catalan
Produced by School Children. International Journal of Corpus
Linguistics. 17/3, 428-441.
MacWhinney, Brian 2012. The CHAT Transcription Format.
<http://childes.psy.cmu.edu/manuals/chat.pdf>.
Malvern, David / Richards, Brian / Chipere, Ngoni / Durán, Pilar
2004. Lexical Diversity and Language Development.
Quantification and Assessment. Hampshire: Palgrave
MacMillan.
Nesselhauf, Nadja 2004. Learner Corpora and their Potential for
Language Teaching. In J. Sinclair (ed.) How to Use Corpora in
Language Teaching. Amsterdam: John Benjamins, 125-152.
Rafel, Joaquim 1998. Diccionari de freqüències. [Frequency List].
Barcelona: Institut d'Estudis Catalans.
186 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera
Rafel, Joaquim / Solanellas, Josep M. 1986. El corpus textual
automatizat de la llengua catalana. Actas de las II Jornadas
Españolas de Documentación Automatizada. 147-161.
Read, John 2000. Assessing Vocabulary. Cambridge: Cambridge
University Press.
Serrat, Elisabet / Sanz-Torrent, Mònica / Bel, Aurora 2004.
Aprendizaje léxico y desarrollo de la gramática: Vocabulario
verbal, aceleración morfológica y complejidad sintáctica
[Lexical Learning and Grammar Development: Morphological
Productivity, Syntactic Complexity, and Verb Vocabulary
Learning]. Anuario de Psicología. 35/2, 221-234.
Strömqvist, Sven / Johansson, Victoria / Kriz, Sarah / Ragnarsdóttir,
Hrafnhildur / Aisenman, Ravid / Ravid, Dorit 2002. Toward a
Cross-linguistic Comparison of Lexical Quanta in Speech and
Writing. Writen Language and Literacy. 5/1, 45-67.
Taulé, Mariona / Martí, Maria A. / Recasens, Marta 2008. AnCora:
Multilevel Annotated Corpora for Catalan and Spanish.
Proceedings of 6th International Conference on Language
Resources and Evaluation, 96-101.
Tolchinsky, Liliana 2004. The Nature and Scope of Later Language
Development. In R. Berman (ed.) Language Development
across Childhood and Adolescence. Amsterdam: John
Benjamins, 233-248.
Torruella, Joan 2009. Los ejes principales en el diseño de un corpus
diacrónico: El caso del CICA [The Principal Axes in the Design
of a Diachronic Corpus: the case of CICA]. In Cantos, Pascual /
Sánchez, Aquilino (eds) A Survey on Corpus-based Research /
Panorama de Investigaciones Basadas en Corpus. Asociación
Española de Lingüística del Corpus, 21-36.
... It is widely used in corpus linguistics research as a way of gauging lexical sophistication (Riedemann, 1996); it reflects both the advanced use of sophisticated, precise, low-frequency terms (Biber, 1995) and an increased command of structurally complex derivatives (Anglin, 1993). Word length shows a clear developmental pattern and is also affected by text genre: as a function of age, the length of words has been shown to increase (Llauradó & Tolchinsky, 2013) and, when compared with narratives, expository texts show significantly longer words (Cutillas et al., 2014). ...
... It is usually assessed as a type-token ratio (TTR), that is, the ratio obtained by dividing the number of different words (types) by the total number of words (tokens) (Bates et al., 1988). Both measures increase as a function of age (Berman & Nir-Sagiv, 2010;Malvern et al., 2004;McCarthy & Jarvis, 2010) and have shown higher indexes in expository than in narrative texts (Cutillas et al., 2014). ...
... In addition, they also tended to use more complex morphological words, such as nominalizations, and more lexical items that help increase the precision of statements and introduce nuances in discourse, such as adjectives, in line with previous research on the development of lexical choices in writing (Berman & Nir-Sagiv, 2009;Crossley et al., 2011;McNamara et al., 2010;Nippold, 2007). The only exception was lexical density, which, in line with Llauradó and Tolchinsky (2013) but contrary to Cutillas et al. (2014), showed no developmental changes. Lexical density has been used as a measure of lexical development in different languages (mostly English, but also Catalan, French, Hebrew and Swedish) with uneven findings, which might point to language-specific trends (Johansson, 2008). ...
Article
Mastering analytical writing involves the proficient use of varied later-acquired grammatical, lexical and discourse forms and functions. Developmental studies have identified specific linguistic features as diagnostic of increasing proficiency. This study examines how these features change throughout educational levels and before and after the implementation of a set of classroom activities aimed at homogenizing participants’ pedagogical input while raising their awareness of analytical texts’ main functions. Two hundred and twelve Spanish speakers from primary, secondary and university levels participate in the study, each producing two analytical texts. The corpus consists of 424 texts produced on the same topic. Results indicate that not all the features pertaining to the same domain show identical changes across educational levels or are identically sensitive to pedagogical work. Productivity measures increase with educational level and after pedagogical work, except for university students. Most lexical, syntactic and discourse-level measures improve with educational level, though their sensitivity to pedagogical work is not straightforward. Findings suggest the need to evaluate writing quality at a high level of granularity.
... The use of low-frequency adjectives should increase with age (Biber, 1995;Ravid & Berman, 2009), because low-frequency terms are typically morphologically complex words. Based on previous studies in Catalan, word length (adjective length in this study) provides a reliable proxy for morphological complexity (Cutillas, Tochinsky, Rosado, & Perera, 2014;Llauradó & Tolchinsky, 2013). ...
... However, unlike a previous study on the same corpus that found a genre effect on lexical word length (Cutillas et al., 2014), no genre effect emerged in adjective length: ...
... Denominal adjectives are a late-emerging category showing the greatest increase from sixth form onwards(Ravid & Levie, 2010) ConclusionsLexical usage is sensitive to communicative purposes and circumstances. Previous studies in various languages have demonstrated that indicators such as lexical diversity, lexical density or word length are significantly affected by age, genre and/or MoP, proving that language users adapt their vocabulary to different text types(Berman & Verhoeven, 2002;Cutillas et al., 2014;Llauradó & Tolchinsky, 2013;Strömqvist, Johansson, Kriz, Ragnarsdóttir, Aisenman, & Ravid, 2002). ...
Preprint
Full-text available
Adjectives, like nouns and verbs, are one of the three major classes of lexical words. But, unlike nouns and verbs, they emerge late in acquisition. In Catalan, as in many other languages, their use is closely linked to the literate lexicon learned at school-age. Thus, the use of adjectives can be a good indicator of later language development. The goal of this study is twofold: to characterize the use of adjectives from age nine to adulthood and to examine the effect of discourse genre (expository and narrative) and mode of production (spoken and written) on frequency of use and word-internal morphological structure. The study takes a corpus-based approach and uses the GRERLI-CAT1 corpus, which contains 316 expository and narrative spoken and written texts produced by 79 Spanish/Catalan bilinguals whose home language is Catalan, at four age and schooling levels (primary school (9-to 10-year-olds), secondary school (12-to 13-year-olds), sixth form (16-to 18-year-olds) and university (adults)). Results show that the use of adjectives expands through school-age and First Language 0(0) especially from sixth form onwards, presenting an increasing pattern. An effect of genre and mode of production on the target features was also detected. Expository texts contain significantly more adjectives per text and clause and lower-frequency adjectives than narrative texts. Written texts contain significantly more adjectives, and lower-frequency and longer adjectives, than spoken texts. Age interacts with mode of production in the use and morphological complexity of adjectives. The four text types analysed (spoken expository, written expository, spoken narrative and written narrative) present a complexity cline, from written expository texts to spoken narratives through spoken expository texts and written narratives.
... L'ús d'adjectius de baixa freqüència hauria d'augmentar amb l'edat (Biber, 1995;Ravid & Berman, 2009), perquè els termes poc freqüents acostumen a ser paraules morfològicament complexes. Segons estudis previs en català, la llargada de la paraula (la llargada de l'adjectiu en aquest estudi) és un indicador fiable de la complexitat morfològica (Cutillas, Tochinsky, Rosado, & Perera, 2014;Llauradó & Tolchinsky, 2013). ...
... La influència de l'alfabetització i una major experiència amb el discurs acadèmic poden jugar un paper crucial en l'augment de la complexitat (Tolchinsky, 2004;Anglin, 1993). Tot i això, a diferència d'un estudi previ sobre el mateix corpus en què es va trobar un efecte del gènere en la llargada de paraula lèxica (Cutillas et al., 2014), aquest cas el gènere discursiu no té efecte significatiu en la llargada de l'adjectiu: només la modalitat de producció l'afecta significativament. Com havíem predit, els textos escrits tenen adjectius més llargs que els orals. ...
... Estudis previs en diferents llengües han demostrat que indicadors com ara la diversitat lèxica, la densitat lèxica o la llargada de paraula es veuen afectats significativament per l'edat, el gènere i/o la modalitat, demostrant que els parlants/escriptors adapten el seu vocabulari a diferents tipus de text (Berman & Verhoeven, 2002;Cutillas et al., 2014;Llauradó, Martí, & Tolchinsky, 2012;Strömqvist et al., 2002). ...
Thesis
Full-text available
Esdevenir parlants competents d’una llengua és un dels desenvolupaments més notables del nostre creixement com a persones. En aquesta tesi ens proposem estudiar l’adquisició del català com a primera llengua (L1), fixant-nos en el desenvolupament d’una categoria lèxica – l’adjectiu – al llarg de l’escolaritat i fins a l’edat adulta, en diferents gèneres discursius (expositiu i narratiu) i modalitats de producció (oral i escrita). Els adjectius són un dels recursos descriptius més potents de què disposa la llengua, l’aprenentatge i el perfeccionament del seu ús és un procés llarg que s’estén més enllà de l’adolescència i està íntimament lligat a l’escolaritat i l’alfabetització, per la qual cosa l’ús dels adjectius pot ser un bon indicador del desenvolupament del lèxic. Durant l’última dècada, diversos estudis han analitzat el creixement i la composició de la categoria adjectiu al llarg de l’etapa escolar, principalment en anglès (Marineille & Kneile, 2012; Nippold & Sun, 2008) i hebreu (Ravid & Levie, 2010). Estudis duts a terme en català també han mostrat els adjectius com una de les categories que més creixement experimenta amb l’edat (Cordero, 2002; Gili Gaya, 1972; Llauradó & Tolchinsky, 2013). El present treball és un estudi de l’ús dels adjectius des d’una aproximació corpus- based. Aplicarem aquesta metodologia a l’anàlisi del desenvolupament tardà del català. L’estudi es basa en el corpus GRERLI-CAT1, el qual consta de 316 textos produïts per 79 informants bilingües català/castellà, que tenen el català com a llengua d’ús habitual. Els informants pertanyen a diferents nivells educatius: Primària (9-10 anys), Secundària (12-13 anys), Batxillerat (16-18 anys) i Universitat (majors de 19 anys). Cada informant va produir quatre textos sobre els conflictes escolars: un expositiu oral, un expositiu escrit, un narratiu oral i un narratiu escrit. Compararem els textos, doncs, tenint en compte aquestes variables independents: l’edat/nivell educatiu, el gènere discursiu i la modalitat de producció. Els objectius generals són tres. En primer lloc, volem identificar quines mesures lèxiques serveixen per a diagnosticar el desenvolupament del lèxic i la diferenciació entre gèneres i modalitats. En segon lloc, volem caracteritzar l’ús dels adjectius segons l’edat i el tipus de text. Finalment, volem identificar quines característiques morfològiques i de freqüència d’ús dels adjectius expliquen millor els judicis dels parlants sobre la seva composicionalitat. Per a complir aquests objectius hem dut a terme tres estudis. El primer centrat en analitzar la diversitat i densitat lèxica, la llargada de paraula i la productivitat verbal, i la influència que hi tenen les variables independents del corpus. El segon encaminat a descriure l’ús dels adjectius segons l’edat, el gènere discursiu i la modalitat de producció; analitzant la quantitat, la freqüència i l’estructura morfològica interna dels adjectius. I el tercer dedicat a esbrinar de quina manera els parlants accedeixen al seu coneixement morfològic a l’hora de jutjar la composicionalitat dels adjectius extrets del corpus GRERLI-CAT1. Els resultats del primer estudi indiquen que la llargada de paraula és el millor diagnòstic del desenvolupament lèxic i de la diferenciació entre gèneres i modalitats. La diversitat i la densitat lèxiques també són bons indicadors evolutius. D’aquestes dues mesures, la primera serveix per a diferenciar entre modalitats i la segona és útil per a fer-ho entre gèneres. En canvi, la productivitat verbal no és apropiada per al nostre objectiu. Els resultats del segon estudi mostren que l’ús dels adjectius s’expandeix al llarg de l’escolaritat, donat que augmenta amb el nivell educatiu. També s’ha detectat un efecte del gènere discursiu i de la modalitat de producció. Els quatre tipus de text analitzats presenten una gradació en la seva complexitat, des dels expositius escrits fins als narratius orals, passant pels expositius orals i els narratius escrits. Els resultats del tercer estudi revelen que els parlants jutgen la composicionalitat dels adjectius en termes d’un contínuum i no pas en termes dicotòmics – adjectius simples vs. complexos. Els predictors més forts de la composicionalitat són la llargada de l’adjectiu, el número de morfemes comptats des del punt de vista sincrònic, la presència d’un prefix productiu i que la base sigui més freqüent que l’adjectiu derivat. En resum, l’anàlisi quantitativa del lèxic mostra que, a mesura que avancen pels nivells educatius, els informants utilitzen un lèxic més complex en els textos expositius i en la modalitat escrita. L’examen de l’ús dels adjectius ens ha permès obtenir una visió evolutiva de la riquesa lèxica i de com s’adapta el vocabulari usat a cada tipus de text. L’estudi sobre la composicionalitat dels adjectius indica que la complexitat morfològica afecta als usos lèxics, però que és un coneixement implícit dels parlants. Així doncs, el present estudi proveeix a la comunitat acadèmica d’una descripció acurada de l’ús dels adjectius al llarg del desenvolupament tardà del català. Disponible a: http://www.tdx.cat/handle/10803/458445 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Becoming proficient speakers of a language is one of the most prominent developments in our growth as people. This thesis is aimed to study Catalan acquisition as first language (L1), focusing on the development of a lexical category – the adjective – throughout school age and until adulthood, in different discourse genres (expository and narrative) and modes of production (spoken and written). Adjectives are one of the most powerful descriptive devices a language has; learning and improving its use is a long process which goes beyond adolescence and it is closely related to schooling and literacy. Therefore, the use of adjectives is a good indicator of lexical development. During the last decade, several studies have analysed the growth and composition of adjective category through the school years, mainly in English (Marineille & Kneile, 2012; Nippold & Sun, 2008) and Hebrew (Ravid & Levie, 2010). Studies which were carried out in Catalan have also shown adjectives as one of the categories which experience a higher increase with age (Cordero, 2002; Gili Gaya, 1972; Llauradó & Tolchinsky, 2013). In this thesis the study of the use of adjectives is tackled from a corpus-based approach. We apply this method to the analysis of later Catalan language development using the GRERLI-CAT1 corpus. The corpus contains 316 texts produced by 79 Spanish/Catalan bilinguals whose home language is Catalan. Subjects were from different schooling levels: Primary School (9- to 10-year-olds), Secondary School (12- to 13-year-olds), Sixth Form (16- to 18-year-olds) and University (more than 19). Every subject produced four texts about school conflicts: spoken expository, written expository, spoken narrative and written narrative. We are going to compare these texts taking into consideration the following independent variables: schooling level, discourse genre and mode of production. The general objectives are three. Firstly, to identify which lexical measures are used in order to diagnose the lexical development and the differences between discourse genres and modes of production. Secondly, to characterize the use of adjectives according to schooling level and text type. Finally, to identify which morphological features and frequency of use characteristics of adjectives explain in a better way the speakers’ judgements about their compositionality. In order to achieve these objectives we have carried out three studies. The first one is focused on the analysis of lexical diversity, lexical density, word length and productivity of verbs, and the influence of the corpus independent variables. The second one is focused on describing the use of adjectives according to schooling level, discourse genre and mode of production, analysing the amount, frequency of use and word-internal morphological structure of adjectives. And the third one is aimed to find out how speakers access to their morphological knowledge when judging the compositionality of the adjectives extracted from the GRERLI-CAT1 corpus. The results of the first study show that word length is the best diagnosis for lexical development, as well as genre and mode differentiation. The lexical diversity and density are also good evolutionary indicators. The first measurement is used to differentiate between modes of production and the second one is useful to differentiate discourse genres. However, productivity of verbs is not appropriate for our objective. The results of the second study show that the use of adjectives expands throughout school years, as it increases with school level. An effect of discourse genre and mode of production has also been detected. The analysed four text types present a complexity cline, from written expository texts to spoken narratives through spoken expository texts and written narratives. The results of the third study reveal that the speakers judge adjectives’ compositionality not in dichotomous terms – simple vs. complex adjectives – but in a continuum. The strongest compositionality predictors are adjective length, the number of morphemes counted from a synchronic point of view, the presence of a productive prefix and the relative frequency, that is, when the base is more frequent than the derived adjective. To sum up, the quantitative analysis of the lexicon shows that, as subjects move on through the different school levels, they use a more complex lexicon in expository texts and in the written mode. The analysis of the adjective usage allowed us to obtain an evolutionary vision of the lexical richness and the way how vocabulary is adapted to each text type. The study about the compositionality of adjectives indicates that the morphological complexity affects lexical usage, although it is an implicit knowledge of the speakers. So the present study provides the academic community with an accurate description of the use of adjectives throughout later Catalan language development. Available at: http://www.tdx.cat/handle/10803/458445
... A similar study by Cutillas et al. (2014) examined oral and written narrative and expository texts produced in Catalan by seventy-nine bilingual Catalan-Spanish bilinguals representing four age groups (10, 13, and 17 year olds, as well as university students and teachers). The measures of lexical diversity used in this study included two operationalizations of the type-token ratio: one where types were operationalized as word forms (i.e. ...
... These studies also show that lexical diversity differences across genres can be found in multiple languages and among varying combinations of genres, and can also be found when measuring the lexical diversity of the whole text as well as when restricting the analysis to a particular category of words. A comparison of Johansson (2008) and Cutillas et al. (2014) further demonstrates that sophisticated VR measures that limit the potentially confounding effects of text length appear to be better at capturing the differences between genres than is the type-token ratio. ...
Article
Many studies have investigated the linguistic characteristics of television and have found important differences between categories of TV programs. Yet, little is known specifically about the lexical profiles of different genres of television discourse. The present study sought to address this gap by exploring the lexical diversity of 714 episodes representing four TV genres. The lexical diversity of each episode was measured using a six-dimensional model of lexical diversity. Multinomial logistic regression was used to determine whether the four TV genres in the present study have unique lexical diversity profiles and whether the genres of individual TV episodes can be predicted based on the adopted model. The results indicated that the four genres do indeed exhibit unique lexical diversity profiles; it was also found that the genres of individual TV episodes can be predicted with approximately 91% accuracy based on this model. These findings were interpreted as underscoring the relevance of lexical diversity to genre analysis of TV shows and the importance of using a theoretically grounded multivariate model of this construct.
Article
Full-text available
This quantitative correlational study investigates the relationship between lexical diversity (LD) and writing competence in English as a Foreign Language (EFL) among 66 Hungarian and 64 Jordanian university students, balanced for gender and English proficiency and all enrolled in English studies programs. Participants completed a Bilingual Language Profile (BLP), providing information about their English use, proficiency, and attitudes, and then crafted narratives following a silent film. Lexical diversity, assessed using Text Inspector (an online software), was measured through VocD and MTLD. Writing competence was evaluated using the holistic scoring approach. The study revealed a significant positive correlation between writing competence scores and LD measures VocD and MTLD in both groups, with slightly higher scores for Hungarians. The Hungarian group, predominantly at the C1 level, showed higher proficiency compared to B2-level-dominant Jordanians. Although gender differences in LD and writing competence were observed, they were not statistically significant. This research highlights the practical implications for educators, emphasizing the correlation between writing competence and Vocabulary knowledge in an EFL context.
Chapter
Full-text available
In this chapter, we present an original model for the measurement of“indi-vidual lexical breadth”. With this construct, we try to represent lexical richnessbased on a solid and non-intuitive foundation, leading to the understanding thata speaker’s lexical breadth is the quantitative reflection of his or her degree oflexical richness. Thus, lexical breadth is a statistically proven quantitative mea-sure of lexical richness. The model presented here is based on both the size andspecificity of the vocabulary produced by a speaker during a lexical test, i.e., ontheir available lexicon. A principal difference between our measure of individuallexical breadth and others relating to lexical richness is that our measure reflectsa characteristic of the speakers and not of the written production they produce.We propose a number of indexes based on our model, notably, the Index of Lexi-cal Decentralization (ILD) and the Index of Lexical Amplitude (ILA). Our intentionis for these indexes to be used in addition to the four above-mentioned parame-ters (i.e., content density, lexical sophistication, lexical accuracy and lexical diver-sity) in the calculation of lexical richness.Our model involves a revised interpretation of the traditional concept of“lex-ical availability”, while still using association tests as source of data. It is based onprevious work in the fields of socio- and psycholinguistic research (see Ávila-Muñoz 2016) in which association tests proved useful, despite the simplicity withwhich they allowed linguistic information to be obtained (López-Morales 1999;Hernández, Izura, and Ellis 2006; Ávila-Muñoz and Villena-Posoda 2010: 177–279)
Article
The development of writing is pivotal in the educational and communicative growth from childhood to adulthood. This special issue harnesses studies that examine the development of text writing in different discourse genres and languages, showcasing a set of text-embedded features that function as the blueprint for children’s evolving writing ability. These studies focus on both word-level and text-level features; a range of discourse genres; different language domains marked as indicators of writing quality; explanatory factors of writing quality; and a crosslinguistic scope encompassing language typology, the developmental span of writing and writing difficulties. The multidimensional perspective taken in this themed issue addresses the particular and universal aspects underlying the development of writing.
Chapter
Full-text available
A l’escola, una de les tasques més habituals a l’hora de parlar d’un tema és escriure una redacció o presentar una exposició oral. Tot i això, aquestes dues activitats gairebé sempre són exclusives: o es fa oralment o es fa per escrit, però no és habitual fer la mateixa activitat en les dues modalitats de producció; sembla com si posar per escrit una exposició que ja ha estat presentada oralment o exposar oralment un text escrit no aportés res de nou a l’aprenentatge dels nens. És per això que en aquest treball volem estudiar les diferències observables entre els textos que s’han fet primer de manera oral i després per escrit, i a la inversa; per a veure de quina manera influeix l’ordre de la modalitat de producció en la qualitat dels textos resultants: quins són més llargs? Quins tenen un lèxic més ric i variat? En definitiva, volem esbrinar quins textos són millors, els que s’han fet primer oralment i després per escrit, o els que s’han produït primer per escrit i després s’han exposat oralment. El corpus utilitzat per al present estudi està format per textos produïts per parlants de català com a primera llengua que realitzen quatre tipus de textos en diferents ordres de realització, segons el gènere discursiu i la modalitat de producció: expositiu oral, expositiu escrit, narratiu oral i narratiu escrit. Les característiques que estudiem són la diversitat lèxica, que és el percentatge de paraules diferents sobre el total de paraules d’un text; la llargada mitjana de paraula, mesura que permet veure si les paraules usades són morfològicament simples o complexes, i la densitat lèxica, és a dir, la proporció entre les paraules lèxiques (noms, adjectius, verbs...) i les funcionals (determinants, preposicions, pronoms...). Un cop comparades aquestes característiques dels textos, discutim quin ordre de modalitat de producció influeix positivament en els textos resultants, per així donar pistes als docents sobre com els nens poden millorar les seves habilitats discursives. [One of the most common tasks when talking about a topic at school is to write an essay or to make an oral presentation. However, these two activities are almost always exclusive: either they are done as a speaking task or they are done as a writing task. It is not usual to do the same activity in the two modalities of production, it seems as if to write a presentation that has already been presented orally or to present orally a written text brings nothing new to children’s learning. That is the reason why in this paper we study the observable differences between the texts that have been done first orally and then have been written, and vice versa, to see how the modality of production order may affect the quality of the resulting texts: Which are the longest ones? Which have a richer and more varied vocabulary? In sum, we want to find out which texts are better: the ones that have been done first orally and then have been written, or those that have been done first in writing and then have been orally exposed. The corpus used for this study consists of a set of texts produced by bilingual Spanish/Catalan speakers that have Catalan as their home language. They produced four types of texts in different production orders, according to text genre and modality of production: spoken expository, written expository, spoken narrative, and written narrative. The characteristics studied are lexical diversity, which is the percentage of different words relative to the total words in a text; word length, that can show if the words used are morphologically simple or complex; and lexical density , i.e. the proportion of lexical words (names, adjectives, verbs...) and functional words (determiners, prepositions, pronouns...). Once these texts characteristics have been compared, we discuss which production order has a positive impact on the resulting texts, so as to give clues to teachers about how children can improve their discursive skills.]
Article
Full-text available
This paper outlines the compilation of a corpus of Catalan written production. The CesCa corpus presents a picture of the Catalan written language throughout compulsory schooling. It contains two kinds of data: Vocabularies of five semantic fields comprising 242,404 lexical forms and Textual data of four different discourse genres consisting of 207,028 tokens. Both vocabularies and the textual data have been morphologically analyzed and lemmatized. The corpus is freely available. This paper will outline the main features of the corpus and make some suggestions as to the uses to which the corpus can be put.
Article
Full-text available
The present study presents contrastive analyses of task-oriented spoken and written discourse in terms of lexical diversity, lexical density, and word length. In an age-matched within-language comparison (Swedish), written discourse consistently scored higher on these measures. It is suggested that the same type of differences will hold for any language, because of the difference between speech and writing in processing constraints. The absolute scores, however, can vary substantially for reasons of language typology. An extended, cross-linguistic analysis (English, Hebrew, Icelandic, Swedish), focusing on word length, was made to substantiate that claim. Further, cross-age-group comparisons of lexical quanta indicated a dynamic interaction between speech and writing in development. Spoken discourse eventually comes to “learn“ from the development of writing.
Article
Full-text available
The studies reported in this volume of WL&L (5: 1–2, 2002) all derive from a joint project entitled “Developing literacy in different contexts and in different languages”, funded by the Spencer Foundation, Chicago. The study encompasses seven languages — Dutch, English, French, Hebrew, Icelandic, Spanish, and Swedish — for which data were collected in Europe, Israel, and the US by graduate research assistants in education, linguistics, and psychology, under the supervision of a project director in each country — each of whom is listed as a first or “lead” author in the articles which follow the introduction to this collection.
Article
Lexical development is a key facet of later language development. To characterize the linguistic knowledge of school age children, performance in the written modality must also be considered. This study tracks the growth of written text-embedded lexicon in Catalan-speaking children and adolescents. Participants (N = 2161), aged from 5 to 16 years produced six different texts: a film explanation, a film recommendation, a joke telling and definitions of a noun, a verb and an adjective. The resultant corpus of 11,332 texts was analyzed using four distributional measures of lexical development: word length, lexical density, use of adjectives and nominalizations. Heylighen’s F-measure of level of text formality was also computed. Word length, use of adjectives and nominalizations were powerful indicators of lexical development. Text type and home language had an effect on these measures. Lexical density showed no clear developmental change, and did not vary by type of text. Heylighen’s F-measure was a weaker developmental indicator. Educational implications are discussed.