ChapterPDF Available

Indicators of lexical growth throughout age, genre and modality for a Catalan L1 corpus

November 2014

November 2014

In book: Specialization and variation in language corpora (pp.161-188)
Chapter: Indicators of lexical growth throughout age, genre and modality for a Catalan L1 corpus
Publisher: Peter Lang
Editors: A. Díaz Negrillo, F. Díaz Pérez

Authors:

Laia Cutillas i Alberich

Universitat de Lleida

Liliana Tolchinsky

University of Barcelona

Elisa Rosado

University of Barcelona

Lexical development is a key facet of later, school-age language development. The goal of the study is twofold: on the one hand, to describe quantitatively the text-embedded lexicon of a corpus of texts produced by informants from nine years of age to adulthood, and on the other hand, to identify which characteristics of the lexicon can be considered as indicators of age, discourse genre, and modality of production differentiation. The GRERLI-CAT1 (Grup de Recerca per a l’Estudi del Repertori Lingüístic, Català L1) corpus is constituted by a set of narrative and expository spoken and written texts in Catalan that were produced by 79 bilingual Catalan/Spanish speakers that have Catalan as their home language. They were distributed in five groups according to level of education: elementary, secondary, high school and university level, and language teachers. The corpus comprises 316 texts, which include a total of 84,081 tokens, 40,612 types and 31,811 lemmas. Four measures were applied for characterizing the corpus lexically: (1) lexical diversity, (2) lexical density, (3) word length, and (4) productivity of verbs. Lexical diversity was selected to gauge the range of vocabulary displayed in the texts, lexical density as an indicator of textual richness and informativeness, word length was taken as an indicator of lexical complexity, and productivity of verbs as an indicator of lexical richness within the verbal domain. We discuss the utility of these four measures as indicators of text construction development in different genres and modalities.

Type-token-lemma ratios by text type.

…

Type-token-lemma ratios by age group.

…

Word length by text type.

…

Word length by age group.

…

Figures - uploaded by Laia Cutillas i Alberich

Content may be subject to copyright.

Content uploaded by Laia Cutillas i Alberich

Content may be subject to copyright.

LAIA CUTILLAS, LILIANA TOLCHINSKY, ELISA ROSADO, JOAN

PERERA

Indicators of lexical growth throughout age, genre

and modality for a Catalan L1 corpus

Abstract

Lexical development is a key facet of later, school-age language

development. The goal of the study is twofold: on the one hand, to

describe quantitatively the text-embedded lexicon of a corpus of texts

produced by informants from nine years of age to adulthood, and on

the other hand, to identify which characteristics of the lexicon can be

considered as indicators of age, discourse genre, and modality of

production differentiation.

The GRERLI-CAT1 (Grup de Recerca per a l’Estudi del

Repertori Lingüístic, Català L1) corpus is constituted by a set of

narrative and expository spoken and written texts in Catalan that were

produced by 79 bilingual Catalan/Spanish speakers that have Catalan as

their home language. They were distributed in five groups according to

level of education: elementary, secondary, high school and university

level, and language teachers. The corpus comprises 316 texts, which

include a total of 84,081 tokens, 40,612 types and 31,811 lemmas.

Four measures were applied for characterizing the corpus

lexically: (1) lexical diversity, (2) lexical density, (3) word length, and

(4) productivity of verbs. Lexical diversity was selected to gauge the

range of vocabulary displayed in the texts, lexical density as an

indicator of textual richness and informativeness, word length was

taken as an indicator of lexical complexity, and productivity of verbs

as an indicator of lexical richness within the verbal domain. We

discuss the utility of these four measures as indicators of text

construction development in different genres and modalities.

160 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

1. Introduction

The general framework of this study is Corpus Linguistics, which

approaches the study of language empirically. It is based on authentic

samples of language use, so data treatment is external to the speaker,

directly observable and, therefore, demonstrable. It uses computerized

corpora, which represent an important source of quantitative

information. The use of these samples enables to determine frequency

of occurrences and lack of specific linguistic elements (Civit 2003: 2).

Corpus linguistics is a method applicable to every area of linguistic

research, from phonology to discourse. Moreover, corpus-based

research is applicable to linguistic education both in first (L1) and

second (L2) language learning. Native corpora, like the one used in

this study, can be very useful for education professionals because they

show what speakers say or write in a specific communication

situation, as well as their typical difficulties (Nesselhauf 2004: 144-

145). In this study, this methodology is applied to first language

acquisition and, more specifically, to the later, school age language

development of Catalan. Catalan is a Romance language that typically

displays a rich morphological system: nouns and adjectives are

inflected for gender and number; pronouns are inflected for gender,

number, person and case; and verbs are inflected for person/number,

tense/aspect and mood. Moreover, it requires gender and number

agreement between noun and adjective and between pronoun and its

antecedent, as well as person and number agreement between subject

and verb.1

1 Catalan is spoken in four Spanish autonomous communities (Aragon, the

Balearic Islands, Catalonia and the Valencian Community), in the French re-

gion of Rousillon, and in the city of Alghero, on the Italian island of Sardinia.

In Catalonia, Catalan and Spanish are equally recognized as official language,

but Catalan is the language of education. Therefore, children educated in

Catalonia are bilingual (Catalan/Spanish) living in a multilingual environ-

ment: “Due to a major surge of immigration over the past decade (3% in 2000

to 13% in 2008), an increasing percentage of children speak a language differ-

ent from Catalan and Spanish at home” (Llauradó / Tolchinsky in press).

Indicators of lexical growth for a Catalan L1 corpus 161

Several Catalan corpora have been compiled during the recent

years. The CCCUB (Corpus del Català Contemporani de la UB)

(Boix et al. 2002), compiled by the Grup d’Estudi de la Variació

(GEV) at the Universitat de Barcelona (UB), contains texts in

different dialects and registers, but only in one modality of production,

the spoken one. The AnCora-CA (Anotated Corpus-Catalan) (Taulé et

al. 2008) includes written journalistic texts, so it could only be

assessed one genre and one modality. The CesCa (Català Escolar

Escrit a Catalunya) (Llauradó et al. 2012) contains different types of

written texts: narrations of a film storyline, recommendations of a

film, or definitions of words and jokes, but only in one modality of

production. These two latter corpora are compiled by the Centre de

Llenguatge i Computació (CLiC) of the UB. The CICA (Corpus

Informatitzat del Català Antic) (Torruella 2009), compiled by the

Grup de Lexicografia i Diacronia (SFI) of the Universitat Autònoma

de Barcelona (UAB), is a diachronic corpus which includes books

written from the 12th to the 15th centuries. The CTILC (Corpus

Textual Informatitzat de la Llengua Catalana) (Rafel / Solanellas

1986), compiled by the Institut d’Estudis Catalans (IEC) with

lexicographic purposes, contains both literary and non- literary texts in

different genres but, again, only in the written modality. The

GRERLI-CAT1 corpus, compiled by the GRERLI (Grup de Recerca

per a l’Estudi del Repertori Lingüístic) at the UB, includes texts in

different genres (narrative and expository) and modalities of

production (spoken and written), produced by subjects of different

school levels (elementary, secondary and high school, university level

and language teachers).

The goals of this paper are (1) to analyse quantitatively the

GRERLI-CAT1 corpus text-embedded lexicon and (2) to identify

which characteristics of the lexicon can be considered as indicators of

lexical growth throughout age, genre, and modality. With this

purpose, four measures are evaluated: lexical diversity, lexical

density, word length and productivity of verbs.

Lexical diversity (Malvern et al. 2004: 19) is selected because it

can gauge the range of vocabulary use. Lexical density (Read 2000:

200) is considered a good indicator of textual richness and

informativeness. Word length (Strömqvist et al. 2002: 48), is chosen

162 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

as a measure of lexical complexity. Productivity of verbs is selected as

an indicator of lexical richness within the verbal domain. This

measure embraces both the diversity of verbal lemmas, that is, the

number of different verbal lemmas used in the corpus, and the

productive use of inflectional morphology of verbs, that is, the number

of different tense forms (Serrat et al. 2004: 221).

2. Corpus compilation

The GRERLI-CAT1 corpus was compiled within the framework of a

cross-linguistic project designed to analyse the development of text

construction abilities in different languages. The corpus was compiled

in 1998 as part of the international project “Developing literacy in

different contexts and in different languages”, Spencer Foundation,

Chicago, United States of America (P.I.: R. A. Berman, Tel Aviv

University). The languages that took part in this study were Dutch, US

English, French, Hebrew, Icelandic, Spanish, and Swedish, with the

subsequent addition of Catalan. The main goals of this project were to

understand how speakers/writers develop their discursive abilities

throughout different educational levels, to analyse how they use the

resources of their respective languages to construct discourse in

different genres (expository and narrative) and modalities of

production (spoken and written), and, finally, to find out common and

language-specific patterns of language use when producing spoken

and written texts (Berman 2002; Aparici 2010).

Language usage, both spoken and written, is always framed

within a specific discourse genre. As Tolchinsky (2004: 235) says:

There is no such thing as neutral use of language: people constantly attune

their speech to specific intentions, purposes, and interlocutors. Thus,

development is viewed in terms of the acquisition of different discourse

genres, and the way that the cultural conventions of genres constrain the use

of linguistic forms.

Indicators of lexical growth for a Catalan L1 corpus 163

The two genres selected for the project were expository and narrative.

Several reasons lead to the selection of these two genres. Firstly,

because the narrative genre has been vastly studied from a

developmental psycholinguistic point of view while the expository

genre has not, since it is a late developing genre, and secondly,

because of the contrast in personal involvement that characterises

these two genres. A first person narrative and the discussion of a topic

illustrate the extremes of a continuum of personal involvement.

Personal involvement of the speaker/writer is demonstrated, among

other things, in the election of the main characters. In personal

narratives the main character is the speaker/writer and, in expository

texts, the topic gains prominence (Aparici 2010).

Regarding modalities of production, spoken and written texts,

differ from the point of view of the demands imposed by processing.

The spoken modality is constrained by on-line processing and can

cause mistakes, breaks, repetitions, reformulations or lack of referents.

On the other hand, the written modality is not subject to time

limitations. It enables the writer to plan thoroughly what he or she

wants to write, to outline or to proofread. For this reason, written texts

are prone to be better organized and be more cohesive and coherent

than spoken texts.

Informants were selected from nine years of age onwards

because at this age they have already attained a command of narrative

structure and of the mechanisms of discursive cohesion (Berman /

Slobin 1994). The sample also included adults because the

development of discursive abilities is a long process that spans from

childhood to beyond adolescence (Berman / Verhoeven 2002: 14).

2.1. Informants

The texts were produced by 79 informants from Barcelona, who were

bilingual speakers of Catalan and Spanish and whose home language

164 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

is Catalan.2 Participants were distributed in five groups, according to

their educational level and age. The elementary school group

comprises 20 students from the 4th year elementary school. The

secondary school group comprises 19 students of the 1st year

secondary school. The high school group comprises 20 students of the

1st year high school. The university group comprises 10 students from

different courses and specialities, specifically, four Science students

and six Humanities students. Finally, the teachers group comprises 10

high school Catalan language teachers. This group is expected to

represent a level of professional use of the language.

Table 1 shows data of all informants’ groups:

Elementary

School

Secondary

School

High School

University

Teachers

n = 20

n = 19

n = 20

n = 10

Range

10;3.

9;6. -

10;9.

13;0.

12;4. -

13;4.

17;4.

16;3. -

18;10.

21;9.

19;1. -

24;2.

43;10.

37;8. -

54;11.

Table 1. Number of informants, mean age and age range.

2.2. Tasks

All informants produced four texts. After watching a three-minute

video without text, participants were asked to produce a spoken

expository text, a written expository text, a spoken narrative text and a

written narrative text. The video shows different conflictive situations

in schools, such as fights, shunning classmates, cheating in exams, etc.

The purpose was that all texts produced by the informants had a

common topic to enable the comparison of the text linguistic

characteristics.

2 It is unlikely to find monolingual speakers of Catalan, since Catalan and Span-

ish are both official languages in Catalonia. All children use Catalan at school

and Spanish is massively present both in the media and in social settings.

Indicators of lexical growth for a Catalan L1 corpus 165

2.3. Procedure

Informants produced the four tasks individually. Data were elicited in

two sessions and four different orders of text production were

followed: A (first session: spoken narrative/written narrative; second

session: spoken expository/written expository); B (first session:

written narrative/spoken narrative; second session: written

expository/spoken expository); C (first session: spoken

expository/written expository; second session: spoken

narrative/written narrative); and D (first session: written

expository/spoken expository; second session: written

narrative/spoken narrative).

2.4. Corpus storage

We have used two different versions of the corpus: clean (net) and

morphologically tagged (morfo). In the following sub-sections, the

characteristics of these two corpus versions will be described.

2.4.1. Clean version (net)

This version consists of text files without format which contain the

production with no additional symbols or indications of spelling

mistakes. All marking of ancillary material, such as repetitions,

reformulations, pauses and comments, is omitted. The transcription

unit is the clause. That is, each text line in this version corresponds to

a clause. We follow Berman and Slobin’s (1994: 660) definition of

this unit:

We define a clause as any unit that contains a unified predicate. By unified,

we mean a predicate that expresses a single situation (activity, event, state).

Predicates include finite and non-finite verbs, as well as predicate adjectives.

This version aims at producing an input file for the morphological

analyser HSMorfo, which performs the morphological tagging. The

counts of tokens and types were computed using the net version.

166 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

2.4.2. Morphologically-tagged version (morfo)

This is a text version that results from the net version of the corpus

after having been morphologically tagged using the HSMorfo Linux

software.3 Files of the morphologically analysed texts contain three

columns: the first one represents the type, the second one the lemma

and the third one includes the EAGLES (Expert Advisory Group on

Language Engineering Standards) tag (EAGLES 1996), which shows

the grammatical category of the type. The measures for the four

analysis dimensions of the current paper were calculated using the

morfo version.

2.4.3. Other versions

There are three more versions of the corpus: mirror (rep), normalized

(nor) and CHAT (cha). The mirror version (only for writing texts)

consists in the transcription of the written productions into a MS Word

file, with the original disposition of paragraphs and lines, punctuation

and spelling. In the normalized version, texts (both spoken and

written) are transcribed in CHAT (Codes for the Human Analysis and

Transcripts) format (MacWhinney 2012). Spoken productions are

transcribed orthographically (not phonetically) including processing

information (pauses, repetitions, reformulations, etc.). In written texts,

following CHAT conventions, spelling mistakes are followed by the

correct word (e.g. vastant [: bastant] ‘enough’). Finally, the CHAT

version only differs from the nor version in that the transcription unit

is the clause.

3 It is important to remark that some files have been analysed using the

FreeLing software, because they included a character which provoked a mal-

functioning on HSMorfo and it finished the analysis before arriving to the

end. However, the working method of both programs is the same, because

they use the same analysis process, as well as EAGLES tags on the tagging

process. They only differ in the treatment of the apostrophized words

(FreeLing attaches them to the next word and HSMorfo separates them) and in

the contractions (FreeLing separates them into two words and HSMorfo treats

them as one word). These treatments have been subsequently fixed using a

software specially designed to correct these errors.

Indicators of lexical growth for a Catalan L1 corpus 167

2.5. Corpus processing

The net version provides the input files for the HSMorfo Linux

software, which performs the morphological analysis and tagging, the

morfo version of the corpus. The HSMorfo software uses the tags

developed by the EAGLES group for morphosyntactic annotation of

European languages, which are adapted to Catalan in Civit (2003).

Each position of the tag corresponds to a feature (whose number

varies for each category), the value of this feature and a code

representing them. Zero shows that a given feature has no value.

Below are some examples of the morphological analysis of three

specific cases of segmentation, which could be a problem for

morphological analysers:

• Multi word expressions, linked by an underscore (_), such as punt

de vista ‘point of view’:

punt de vista 

1. punt_de_vista punt_de_vista (‘point of view’) NCMS000

(Noun Common Masculine Singular)

• Words containing a dash (-) and/or an apostrophe (’):

a) Words with clitic pronouns, which are attached to a verb

using a dash and/or an apostrophe, such as controla’ls-ho

‘control it to them’:

controla’ls-ho 

1. controla controlar (‘to control’) VMM02S0 (Verb Main

Imperative 2nd Singular)

2. ’ls ell (‘he’) PP3CP000 (Personal Pronoun 3rd Common

Plural)

3. -ho ho (‘it’) PP3NN000 (Personal Pronoun 3rd Neuter

Invariable)

b) Apostrophized words, such as l’únic ‘the unique’:

l’únic 

1. l’ el (‘the’) DA0CS0 (Determinate Article Common

Singular)

2. únic únic (‘unique’) AQ0MS0 (Adjective Qualifying

Masculine Singular)

168 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

• Contractions (prepositions attached to an article) are treated by the

analyser as compound prepositions, and they include grammatical

information about gender and number. In other words, they

present the preposition features (three first values of the tag) and

the article features (the two last values of the tag), such as dels ‘of

the’:

dels 

1. dels del (‘of the’) SPCMP (Adposition Preposition

Complex Masculine Plural)

As we have seen in this section, the characteristics of the GRERLI-

CAT1 corpus are as follows: it contains 316 texts, which include a

total of 84,081 tokens, 40,612 types and 31,811 lemmas. The main

features that identify this corpus are: (i) Language: the corpus contains

texts in Catalan; (ii) Developmental dimension: the corpus reflects

later, school age language development, because it contains data

produced by informants from nine years old to adulthood; (iii) Genre:

it contains texts in two discourse genres: narrative and expository; (iv)

Modality: it includes texts of two modalities of production: spoken

and written; and finally, (v) Corpus storage: the corpus is stored in

five different versions that can serve as input for different

computational platforms. Moreover, because the corpus is part of a

cross-linguistic project, in which the same data elicitation procedure

was used for the different languages, it can be compared with the

corpora in the other languages that took part in the study.

3. Dimensions of analysis

In order to identify the characteristics of the lexicon that can be

considered as indicators of age, discourse genre and modality of

production differentiation, the selected measures are: (3.1.) lexical

diversity, calculated using type-token, lemma-token and lemma-type

ratios; (3.2.) lexical density, i.e. the proportion of content words

Indicators of lexical growth for a Catalan L1 corpus 169

relative to the total number of words; (3.3.) word length, i.e. the

number of letters of each lexical word; and (3.4.) productivity of

verbs, characterised by the diversity of verbal lemmas used and the

productive use of inflectional morphology of verbs, i.e. the number of

verb tenses associated to each verbal lemma.

3.1. Lexical diversity

Lexical diversity (Malvern et al. 2004) is a widely used measure of

language development, because it can gauge the range of vocabulary

use. Although the type-token ratio is by far the most commonly used

measure of lexical diversity, it is much less useful in measuring

vocabulary of inflectional languages like Catalan than another

measure, the lemma-token ratio, which provides more accurate

information about lexical diversity, as Granger / Wynne (2000: 251)

say:

A learner who uses five different forms of the verb go (go/goes/going/

gone/went) in one and the same text has a less varied vocabulary than the one

who uses five different lemmas (such as go/come/leave/enter/return).

Moreover, a parameter that calculates the correlation between lemmas

and types, the lemma-type ratio, can also be useful to study lexical

diversity of Catalan. The type-token ratio is obtained by dividing the

number of types by the number of tokens, the lemma-token ratio by

dividing the number of lemmas by the number of tokens; and finally,

the lemma-type ratio by dividing the number of lemmas by the

number of types.

3.2. Lexical density

Lexical density (Read 2000) is considered to be a good indicator of

textual richness. If content words convey the vast bulk of semantic

content, then the relative proportion of lexical words used in a text can

provide an idea of its informativeness. Lexical density is obtained by

170 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

dividing the number of lexical words (nouns, main verbs, adjectives

and adverbs) by the total of words in a sample.

3.3. Word length

Word length (Strömqvist et al. 2002) is regarded as an indicator of

lexical complexity. The longer words are supposed to be

derivationally more complex words, so word length is a suitable

measure of lexical complexity. Function words (auxiliary verbs,

determinants, pronouns, prepositions, conjunctions and interjections)

have been removed from this count because of their grammatical

status. Means for word length are obtained by counting the number of

letters in each lexical word over the total number of words.

3.4. Productivity of verbs

Productivity of verbs is seen here as an indicator of lexical richness

within the verbal domain. In the context of the present study we

consider the diversity of verbal lemmas (number of different lexical

verbs that appear in the corpus) apart from the productive use of

the inflectional verb morphology (number of different verb forms –

marking of tense/aspect or mood– for each specific verbal lemma).

Diversity of verbal lemmas is obtained by dividing the total of verbal

lemmas by the number of different verbal lemmas used in a sample.

Productive use of inflectional morphology (Serrat et al. 2004) is

calculated by counting the number of different tenses used for each

verbal lemma. For these counts, differences in person/number

inflection have not been considered, and the non-finite forms of the

verb (infinitive, participle and gerund) have only been considered as

part of a compound verb. That is, tenses formed by an auxiliary verb

(haver ‘to have’ or anar ‘to go’, and ser ‘to be’, for passive forms)

plus a non-finite form of the main verb.

Indicators of lexical growth for a Catalan L1 corpus 171

4. Results

4.1. Quantitative description of the corpus

Table 2 presents the distribution of tokens, types and lemmas by text

type. There are some differences between both the two discourse

genres and the two modalities of production. Expository texts are

significantly longer than narrative texts, both spoken and written: (F

(1, 74) = 19.024, p = .000) for tokens, (F (1, 74) = 22.661, p = .000)

types, (F (1, 74) = 24.078, p = .000) and lemmas, respectively. Spoken

texts in both genres have significantly more tokens, types and lemmas

than written texts: (F (1, 74) = 25.669, p = .000) for tokens, (F (1, 74)

= 12.080, p = .001) types, (F (1, 74) = 12.556, p = .001) and lemmas,

respectively. By text types, spoken expository has more tokens than

spoken narrative and written expository than written narrative. Types

and lemmas present a different pattern: expository texts, both spoken

and written, have more types and lemmas than narrative ones. A

significant interaction between genre and modality is also found for

tokens (F (1, 74) = 11.423, p = .001), types (F (1, 74) = 11.113, p =

.001) and lemmas (F (1, 74) = 8.451, p = .005).

Table 3 presents the distribution of tokens, types and lemmas by

age group. There is a significant effect of age: (F (4, 74) = 13.910, p =

.000) for tokens, (F (4, 74) = 24.176, p = .000) types, (F (4, 74) =

24.047, p = .000) and lemmas, respectively. High school group has the

highest number of tokens, types and lemmas, followed by the older

groups, university and teachers. The next group is secondary school

and elementary school is the age group with the lowest number of

tokens, types and lemmas. So the counts of tokens, types and lemmas

increases gradually from elementary to high school, and then

decreases progressively from high school to university and then from

university to teachers. However, the means of tokens, types and

lemmas show a different pattern: there is an increase from elementary

school to university, but then the means of tokens, types and lemmas

decrease in the teachers’ group. An interaction is found between genre

and age for tokens (F (4, 74) = 4.660, p = .002), types (F (4, 74) =

172 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

3.748, p = .008) and lemmas (F (4, 74) = 3.732, p = .008). The

number of tokens also shows an interaction between modality and age

(F (4, 74) = 2.574, p = .045).

Counts

Spoken

Expository

Tokens

30,315

383.73

406.69

Types

12,269

155.30

107.77

Lemmas

9,479

119.99

79.64

Written

Expository

Tokens

16,929

214.29

143.39

Types

9,828

124.41

65.19

Lemmas

7,791

98.62

51.38

Spoken

Narrative

Tokens

22,151

280.39

275.56

Types

9,730

123.16

77.97

Lemmas

7,658

96.94

59.63

Written

Narrative

Tokens

14,686

185.90

104.56

Types

8,785

111.20

53.98

Lemmas

6,883

87.13

41.98

Table 2. Tokens, types and lemmas distribution by text type.

Counts

Elementary

School

Tokens

11,223

140.28

57.63

Types

6,356

79.45

25.62

Lemmas

5,013

62.66

18.63

Secondary

School

Tokens

12,200

160.52

71.70

Types

6,592

86.73

28.19

Lemmas

5,137

67.59

19.55

High School

Tokens

26,082

326.02

192.00

Types

12,059

150.73

56.26

Lemmas

9,252

115.65

41.66

University

Tokens

19,627

490.67

282.00

Types

8,241

206.02

69.98

Lemmas

6,460

161.50

53.38

Teachers

Tokens

14,949

373.72

75.37

Types

7,364

184.10

30.50

Lemmas

5,949

148.72

25.90

Table 3. Tokens, types and lemmas distribution by age.

Indicators of lexical growth for a Catalan L1 corpus 173

4.2. Lexical diversity

Figure 1 illustrates the results by text type of the three measures

selected for characterising lexical diversity: types per token, lemmas

per token and lemmas per type ratios. Firstly, the results of the type-

token ratio are presented. Regarding the differences between the two

discourse genres, no significant effect of genre is found. As for

modality, type-token ratio is significantly higher in the written than in

the spoken texts (F (1, 74) = 175.278, p = .000). By text types, written

texts, both expository and narrative, have the highest type-token ratio,

followed by spoken narrative and spoken expository. There is an

almost significant interaction between genre and modality (F (1, 74) =

3.568, p = .063). Secondly, the lemma-token ratio results show that,

like in the type-token ratio, genre has no significant effect. The

lemma-token ratio is significantly higher in the written than in the

spoken modality (F (1, 74) = 100.689, p = .000). By text types, the

lemma-token ratio is higher in spoken narrative texts than in spoken

expository, but in written narrative is lower than in written expository.

This ratio shows an interaction between discourse genre and modality

of production (F (1, 74) = 4.349, p = .040). Finally, the results for the

lemma-type ratio are presented. This ratio is not affected by genre or

modality. Spoken narrative texts have the highest lemma-type ratio,

followed by written expository, spoken expository and written

narrative. An almost significant interaction between genre and

modality is found (F (1, 74) = 3.529, p = .064).

Figure 2 shows type-token, lemma-token and lemma-type ratios

by age group. The type-token ratio is significantly affected by age (F

(4, 74) = 10.423, p = .000), it decreases from elementary school to

university but it increases in the teachers’ group. This result indicates

a higher lexical diversity in the oldest age group, so an interaction

both between genre and age (F (4, 74) = 5.349, p = .001) and between

modality and age (F (4, 74) = 4.608, p = .002) is found. Regarding the

lemma-token ratio, it trends similarly: there is a significant effect of

age (F (4, 74) = 7.310, p = .000), there is an interaction between genre

and age (F (4, 74) = 4.070, p = .005) and between modality and age (F

(4, 74) = 4.422, p = .003). Finally, the lemma-type ratio only shows an

almost significant effect of age (F (4, 74) = 2.252, p = .071).

174 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

Figure 1. Type-token-lemma ratios by text type.

Figure 2. Type-token-lemma ratios by age group.

Indicators of lexical growth for a Catalan L1 corpus 175

4.3. Lexical density

Table 4 presents lexical density by text type; roughly, almost half of

the tokens used on each text type are lexical tokens. Expository texts

have significantly higher lexical density than narrative ones, (F (1, 74)

= 30.151, p = .000). No significant effect of modality is found. By text

types, lexical density is higher in spoken expository texts than in

spoken narrative, but in written narrative is higher than in written

expository texts. There is a significant interaction between genre and

modality (F (1, 74) = 4.524, p = .037).

Table 5 presents the results by age group. Lexical density is

significantly affected by age (F (4, 74) = 10.193, p = .000). No

interactions between genre and age or between modality and age are

found. We can observe from the counts of lexical density that, except

for the secondary school group, lexical density shows a developmental

pattern. However, since the mean lexical density of each age group

increases gradually, we run Bonferroni post-hoc comparisons in order

to determine between which age groups significant differences are

found. Regarding discourse genres, significant differences in

expository texts are between the youngest and the oldest groups, that

is, between elementary school (M = .484, SD = .042) and teachers (M

= .527, SD = .032) (d = .043); and also between secondary school (M

= .470, SD = .036) and both university (M = .509, SD = .020) (d =

.039) and teachers (d = .057). For narrative texts, Bonferroni post-hoc

analyses show that significant differences are to be found between

elementary school (M = .431, SD = .044) and the three oldest groups,

that is, high school (M = .469, SD = .029) (d = .038), university (M =

.493, SD = .018) (d = .061) and teachers (M = .488, SD = .021) (d =

.056); and also between secondary school (M = .446, SD = .048) and

both university group (d = .046) and teachers (d = .041). As for

modality, significant differences in spoken texts are found between the

youngest and the oldest groups, that is, between elementary school (M

= .460, SD = .028) and both university group (M = .506, SD = .021) (d

= .045) and teachers (M = .505, SD = .025) (d = .044); and also

between secondary school (M = .456, SD = .038) and both university

group (d = .049) and teachers (d = .048). Finally, for written texts,

significant differences are found between elementary school (M =

176 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

.455, SD = .040) and the two oldest groups, that is, university group

(M = .496, SD = .021) (d = .041) and teachers (M = .509, SD = .025)

(d = .054); and also between secondary school (M = .459, SD = .035)

and teachers (d = .049).

Counts

Spoken

Expository

Tokens

30,315

383.73

406.69

Lexical tokens

15,071

190.92

210.85

Lexical density

.497

.487

.040

Written

Expository

Tokens

16,929

214.29

143.39

Lexical tokens

6,960

107.06

73.19

Lexical density

.411

.495

.044

Spoken

Narrative

Tokens

22,151

280.39

275.56

Lexical tokens

10,527

133.53

138.20

Lexical density

.475

.463

.042

Written

Narrative

Tokens

14,686

185.90

104.56

Lexical tokens

6,897

86.25

52.78

Lexical density

.469

.455

.053

Table 4. Lexical density by text type.

Counts

Elementary

School

Tokens

11,223

140.28

57.63

Lexical tokens

5,182

64.23

27.36

Lexical density

.461

.457

.031

Secondary

School

Tokens

12,200

160.52

71.70

Lexical tokens

4,122

73.96

34.42

Lexical density

.337

.458

.031

High School

Tokens

26,082

326.02

192.00

Lexical tokens

12,627

157.72

97.27

Lexical density

.484

.481

.022

University

Tokens

19,627

490.67

282.00

Lexical tokens

9,902

247.32

142.20

Lexical density

.504

.501

.016

Teachers

Tokens

14,949

373.72

75.37

Lexical tokens

7,590

190.82

41.95

Lexical density

.507

.021

Table 5. Lexical density by age group.

Indicators of lexical growth for a Catalan L1 corpus 177

4.4. Word length

Figure 3 shows the results of word length by text type. Word length

shows significant differences between the genres and modalities

studied. Regarding the differences between genres, expository texts

have a higher word length than narrative texts, there is a significant

effect of genre (F (1, 74) = 19.614, p = .000). As for modality, written

texts have a higher word length than spoken texts, and significant

differences are found (F (1, 74) = 68.432, p = .000). By text type,

written expository texts have the longest words, followed by spoken

expository, written narrative and spoken narrative; there is a significant

interaction between genre and modality (F (1, 74) = 4.478, p = .038).

Figure 4 illustrates the word length results by age group. Word

length is significantly affected by age (F (4, 74) = 3.728, p = .000).

There is also a significant interaction between genre and age (F (4, 74)

= 4.707, p = .002) as well as between modality and age (F (4, 74) =

7.633, p = .000). This measure shows a clear developmental pattern,

because it increases throughout the subjects’ age, presenting a highest

increase between university and teachers groups.

Figure 3. Word length by text type.

178 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

Figure 4. Word length by age group.

4.5. Productivity of verbs

4.5.1. Diversity of verbal lemmas

Table 6 shows the diversity of verbal lemmas by text type. No

significant effect of genre is found. Written texts have higher diversity

of verbal lemmas than spoken ones (F (1, 74) = 114.586, p = .000). By

text type, written expository is the text type with the highest diversity

of verbal lemmas, .658, then spoken expository, with a diversity of

.516, closely followed by written narrative, .508, and finally, spoken

narrative is the text type with the lowest diversity of verbal lemmas,

.463. Table 7 shows the diversity of verbal lemmas by age group.

There is a significant effect of age, (F (4, 74) = 4.117, p = .005). The

results show that diversity increases gradually, except for the

university group. There is a significant interaction between genre and

age (F (4, 74) = 3.659, p = .009) as well as between modality and age

(F (4, 74) = 5.549, p = .001).

Indicators of lexical growth for a Catalan L1 corpus 179

Counts

Spoken

Expository

Verbal lemmas

635

77.61

83.78

Different verbal lemmas

328

28.63

18.64

Diversity

.516

.471

.157

Written

Expository

Verbal lemmas

407

40.59

25.77

Different verbal lemmas

268

22.76

12.24

Diversity

.658

.597

.116

Spoken

Narrative

Verbal lemmas

621

61.09

53.90

Different verbal lemmas

288

24.27

14.65

Diversity

.463

.454

.125

Written

Narrative

Verbal lemmas

584

39.92

20.64

Different verbal lemmas

297

22.16

10.71

Diversity

.508

.575

.115

Table 6. Diversity of verbal lemmas by text type.

Counts

Elementary

School

Verbal lemmas

323

33.02

14.57

Different verbal lemmas

154

16.76

5.87

Diversity

.476

.573

.096

Secondary

School

Verbal lemmas

301

35.14

16.81

Different verbal lemmas

151

17.23

6.41

Diversity

.501

.555

.089

High

School

Verbal lemmas

640

68.35

37.14

Different verbal lemmas

327

27.91

8.67

Diversity

.510

.529

.082

University

Verbal lemmas

597

96.12

54.21

Different verbal lemmas

285

37.92

14.13

Diversity

.477

.529

.035

Teachers

Verbal lemmas

546

67.30

11.05

Different verbal lemmas

293

33.17

6.10

Diversity

.536

.605

.033

Table 7. Diversity of verbal lemmas by age group.

4.5.2. Productive use of inflectional morphology of verbs

The productive use of inflectional morphology by text type presents

no significant differences between discourse genres and modalities of

production. Verbal lemmas with the least productive use of

inflectional morphology, that is, verbs which appear in only one tense,

180 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

represent more than a half of the total in all text types. Spoken

narrative is the text type which shows a more productive use of

inflectional morphology, verbs like anar ‘to go’ and dir ‘to say’ are

used in 11 different tenses. Written expository is the text type which

has less productive use of inflectional morphology, verbs like fer ‘to

do’ and veure ‘to see’, are used in seven different tenses.

The productive use of inflectional morphology by age group

presents no significant differences. More than a half of verbal lemmas

have the least productive use of inflectional morphology, they appear

in only one tense. Elementary school is the age group that shows a less

productive use of inflectional morphology, for example, the verb dir

‘to say’ is used in eight different tenses. Secondary school group has a

similar productive use of inflectional morphology, for instance, the

verb tenir ‘to have got’ is used in nine tenses. The oldest age groups

have more productive use of inflectional morphology, for example,

the verb dir ‘to say’ is used in 11 tenses in all these three groups.

Verbal lemmas which have a higher productive use seem to be

high frequency verbs (e.g. dir ‘to say’, fer ‘to do’, tenir ‘to have got’).

By contrast, verbal lemmas with a lower productive use seem to be

low frequency verbs (e.g. traumatitzar ‘to traumatize’, senyorejar ‘to

dominate’, violentar ‘to embarrass’). We have looked up in the DdF

(Diccionari de Freqüències, ‘Frequency List’) (Rafel 1998) of the

Institut d’Estudis Catalans, which is based on the CTILC (Corpus

Textual Informatitzat de la Llengua Catalana) and includes literary

and non-literary written texts in Catalan, in order to compare the most

frequent verbal lemmas of the GRERLI-CAT1 corpus with the

frequency of the same verbal lemmas in the CTILC corpus. Table 8

shows the 10 most frequent verbal lemmas (from highest to lowest) of

the GRERLI-CAT1 corpus on the first column; the second represents

their absolute frequency, that is, the number of occurrences of tokens

of this verbal lemma related to the total of tokens of the GRERLI-

CAT1 corpus (84,081). The third column shows their relative

frequency, that is, the percentage of the lemma representation to the

total of tokens of the corpus. In the fourth column, the same verbal

lemmas are presented from highest to lowest frequency, according to

the DdF; the fifth column represents their absolute frequency related

to the total of tokens of the CTILC corpus (52,375,044); and, finally,

Indicators of lexical growth for a Catalan L1 corpus 181

the sixth column shows their relative frequency. As we can see, the

results are quite similar, if we take into account the huge difference

between the number of tokens of both corpora. This confirms that

verbs with more productive use are the most frequently used.

Verbal lemma

GRERLI-CAT1

Absolute

frequency

Relative

frequency

Verbal lemma

CTILC

Absolute

frequency

Relative

frequency

dir ‘to say’

3.967

ser ‘to be’

1,000,352

1.951

fer ‘to do’

3.636

fer ‘to do’

338,070

.659

ser ‘to be’

3.058

tenir

‘to have got’

236,795

.462

anar ‘to go’

2.562

dir ‘to say’

222,380

.433

haver ‘to have’

2.149

poder ‘can’

211,688

.413

passar

‘to happen’

1.901

anar ‘to go’

114,340

.223

poder ‘can’

1.901

estar ‘to be’

111,112

.216

tenir

‘to have got’

1.405

passar

‘to happen’

64,499

.125

estar ‘to be’

1.157

començar

‘to start’

31,259

.060

començar

‘to start’

.661

haver ‘to have’

1,984

.003

Table 8. Comparative between the 10 most frequently used verbal lemmas of the

GRERLI-CAT1 corpus and the results of the DdF.

5. Conclusion

Our study enables us to distinguish the specific contribution of four

lexicon-related measures for the corpus-based study of lexical

development. The results show that word length offers the best

diagnosis of lexical development, genre and modality differentiation.

Lexical density is a good indicator of developmental changes and

genre differentiation, but not of modality differentiation. Lexical

diversity, by contrast, is a good indicator of developmental changes

and modality differences, but not of genre differentiation. Similarly,

182 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

diversity of verbal lemmas is also an indicator of developmental

changes and modality differences. Finally, productive use of

inflectional morphology is not appropriate for characterizing later,

school age language development of lexicon.

Word length appears as a valid measure to differentiate between

school level, discourse genres and modalities of production. There are

significant differences between age groups, effects of genre and

modality and also an interaction between genre and age, modality and

age and also between genre and modality. Word length shows a clear

developmental pattern from childhood to adulthood. Moreover, word

length is significantly higher in the expository genre than in the

narrative one, and in written compared to the spoken modality. In line

with other studies, such as Stömqvist et al. (2002), for Swedish

language, or Llauradó / Tolchinsky (in press), for Catalan language,

word length appears as the best diagnosis of development, genre and

modality differences.

Lexical density is a valid measure to differentiate between

school level and discourse genres. Significant differences between age

groups and an effect of genre are found, and there is also an

interaction between genre and modality. Lexical density yielded no

clear developmental pattern, like the ones found in previous studies

(Llauradó / Tolchinsky in press), though we found differences

between the youngest and the oldest groups. As for differences by

genre, expository texts are denser than narrative ones. However, and

unlike other studies in which lexical density showed significant

differences between written and spoken texts (Strömqvist et al. 2002),

we found no significant effect of modality of production.

Lexical diversity, measured by type-token, lemma-token and

lemma-type ratios, is a valid measure for revealing developmental

changes and modality differences. However, lexical diversity does not

account for genre differentiation. In both type-token and lemma-token

ratios, there is a significant effect of age and modality, and

interactions between genre and age, modality and age and also

between genre and modality. Regarding the lemma-type ratio, no

significant differences between genres, modalities and age groups are

found. The highest lexical diversity is found in the oldest age group, in

line with the findings of other studies (Berman / Verhoeven 2002;

Indicators of lexical growth for a Catalan L1 corpus 183

Strömqvist et al. 2002). Written texts have a higher type-token and

lemma-token ratios than spoken texts. This finding also corroborates

those found for type-token ratio for Swedish by Strömqvist et al.

(2002).

Finally, productivity of verbs functions halfway to our

purposes. As for diversity of verbal lemmas, there is a significant

effect of age and modality of production, and interactions between

genre and age and between modality and age are also observed.

Diversity of verbal lemmas increases gradually through age groups,

except for the university group. Written texts have higher diversity

than spoken ones. As for the productive use of inflectional

morphology of verbs, it does not seem as a valid measure to

differentiate between school level, discourse genres and modalities of

production. Nine-year-olds and adults display a similar use in the

production of inflectional morphology of verbs.

In sum, for our purposes, the most suitable measure is word

length, because it can characterise differences in all variables: age,

genre and modality. Lexical density is appropriate to differentiate

between ages and discourse genres, and lexical diversity reveals as

useful measure to differentiate between ages and modalities of

production. Finally, diversity of verbs can serve to differentiate

between ages and modalities, but productive use of inflectional

morphology of verbs is not a valid indicator of text construction

development in different genres and modalities.

The characteristics of the GRERLI-CAT1 corpus will allow us

to explore the type of lexicon used by different age groups, as well as

the syntactic organization of different types of texts, the textual

components of each discourse genre or the effect of the order of text

production (Cutillas in press), among others. Cross-linguistic

comparisons with the other languages of the main project will enable

us to analyse the similarities and differences between these languages,

as well. Moreover, comparisons with the same corpus for Catalan as

L2 (GRERLI-CAT2) will provide us with valid data to characterise

Catalan language development (L1) and language learning (L2).

184 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

Acknowledgements

Data for this paper were gathered within the I+D research project

Hacia el dominio experto de la lengua: estudio comparado del

desarrollo del repertorio lingüístico nativo y no nativo en castellano y

catalán ‘Becoming an expert user of language: a comparative study of

the development of native and non-native linguistic repertoire in

Spanish and Catalan’ funded by the Spanish Ministry of Science and

Innovation, reference code EDU2009-08862.

References

AnCora-CA corpus. <http://clic.ub.edu/corpus/ancora/>

Aparici, Melina 2010. El desarrollo de la conectividad discursiva en

diferentes géneros y modalidades de producción [The

Development of Discursive Connectivity in Distinct Genres and

Modalities of Production] (Unpublished PhD dissertation).

Barcelona: Universitat de Barcelona.

Berman, Ruth A. / Slobin, Dan 1994. Relating Events in Narrative: A

Crosslinguistic Developmental Study. Mahwah: Lawrence

Erlbaum Associates.

Berman, Ruth A. / Verhoeven, Ludo 2002. Cross-linguistic

Perspectives on the Development of Text Production Abilities:

Speech and Writting. Written Language and Literacy. 5/1, 1-44.

Boix, Emili / Alturo, Núria / Perea, Maria P. 2002. Corpus del Català

Contemporani de la Universitat de Barcelona (CUB). A General

Presentation. Romanistische Korpuslinguistik. Korpora und

gesprochene Sprache. Tübingen: Gunter Narr, 155-170.

<http://www.ub.edu/ cccub/>.

CesCa corpus. <http://clic.ub.edu/corpus/corpus/cesca/>

CICA corpus. <http://www.cica.cat/>

Indicators of lexical growth for a Catalan L1 corpus 185

Civit, Montserrat 2003. Criterios de desambiguación morfosintáctica

de corpus en español [Morphosyntactic Disambiguation

Criteria of Corpora in Spanish]. Alicante: Sociedad Española

para el Procesamiento del Lenguaje Natural.

CTILC corpus. <http://ctilc.iec.cat/>

Cutillas, Laia In press. Parlar per escriure o escriure per parlar? [Talk

for Writing or Write for Talking?]. In L. Tolchinsky (ed.) Cap a

una explotació didàctica dels corpus lingüístics [Towards an

Educational Explotiation of Linguistic Corpora]. Barcelona:

Horsori.

EAGLES 1996. Recommendations for the Morphosyntactic

Annotation of Corpora. <http://www.ilc.cnr.it/EAGLES96/

annotate/annotate.html>.

Granger, Sylviane / Wynne, Martin 2000. Optimising Measures of

Lexical Variation in EFL Learner Corpora. In Granger,

Sylviane / Wynne, Martin (eds) Corpora Galore: Analyses and

Techniques in Describing English. Amsterdam: Rodopi, 249-

257.

Llauradó, Anna / Tolchinsky, Liliana In press. The Growth of the

Text-embedded Lexicon in Catalan from Chilhood to

Adolescence. First Language.

Llauradó, Anna / Martí, Maria Antònia / Tolchinsky, Liliana 2012.

Corpus CesCa: Compiling a Corpus of Written Catalan

Produced by School Children. International Journal of Corpus

Linguistics. 17/3, 428-441.

MacWhinney, Brian 2012. The CHAT Transcription Format.

<http://childes.psy.cmu.edu/manuals/chat.pdf>.

Malvern, David / Richards, Brian / Chipere, Ngoni / Durán, Pilar

2004. Lexical Diversity and Language Development.

Quantification and Assessment. Hampshire: Palgrave

MacMillan.

Nesselhauf, Nadja 2004. Learner Corpora and their Potential for

Language Teaching. In J. Sinclair (ed.) How to Use Corpora in

Language Teaching. Amsterdam: John Benjamins, 125-152.

Rafel, Joaquim 1998. Diccionari de freqüències. [Frequency List].

Barcelona: Institut d'Estudis Catalans.

186 Laia Cutillas, Liliana Tolchinsky, Elisa Rosado, Joan Perera

Rafel, Joaquim / Solanellas, Josep M. 1986. El corpus textual

automatizat de la llengua catalana. Actas de las II Jornadas

Españolas de Documentación Automatizada. 147-161.

Read, John 2000. Assessing Vocabulary. Cambridge: Cambridge

University Press.

Serrat, Elisabet / Sanz-Torrent, Mònica / Bel, Aurora 2004.

Aprendizaje léxico y desarrollo de la gramática: Vocabulario

verbal, aceleración morfológica y complejidad sintáctica

[Lexical Learning and Grammar Development: Morphological

Productivity, Syntactic Complexity, and Verb Vocabulary

Learning]. Anuario de Psicología. 35/2, 221-234.

Strömqvist, Sven / Johansson, Victoria / Kriz, Sarah / Ragnarsdóttir,

Hrafnhildur / Aisenman, Ravid / Ravid, Dorit 2002. Toward a

Cross-linguistic Comparison of Lexical Quanta in Speech and

Writing. Writen Language and Literacy. 5/1, 45-67.

Taulé, Mariona / Martí, Maria A. / Recasens, Marta 2008. AnCora:

Multilevel Annotated Corpora for Catalan and Spanish.

Proceedings of 6th International Conference on Language

Resources and Evaluation, 96-101.

Tolchinsky, Liliana 2004. The Nature and Scope of Later Language

Development. In R. Berman (ed.) Language Development

across Childhood and Adolescence. Amsterdam: John

Benjamins, 233-248.

Torruella, Joan 2009. Los ejes principales en el diseño de un corpus

diacrónico: El caso del CICA [The Principal Axes in the Design

of a Diachronic Corpus: the case of CICA]. In Cantos, Pascual /

Sánchez, Aquilino (eds) A Survey on Corpus-based Research /

Panorama de Investigaciones Basadas en Corpus. Asociación

Española de Lingüística del Corpus, 21-36.

Linguistic indicators of text quality in analytical texts: developmental changes and sensitivity to pedagogical work ( Indicadores lingüísticos de la calidad textual en los textos analíticos: cambios evolutivos y sensibilidad al trabajo pedagógico )

Article

Jan 2021

Mastering analytical writing involves the proficient use of varied later-acquired grammatical, lexical and discourse forms and functions. Developmental studies have identified specific linguistic features as diagnostic of increasing proficiency. This study examines how these features change throughout educational levels and before and after the implementation of a set of classroom activities aimed at homogenizing participants’ pedagogical input while raising their awareness of analytical texts’ main functions. Two hundred and twelve Spanish speakers from primary, secondary and university levels participate in the study, each producing two analytical texts. The corpus consists of 424 texts produced on the same topic. Results indicate that not all the features pertaining to the same domain show identical changes across educational levels or are identically sensitive to pedagogical work. Productivity measures increase with educational level and after pedagogical work, except for university students. Most lexical, syntactic and discourse-level measures improve with educational level, though their sensitivity to pedagogical work is not straightforward. Findings suggest the need to evaluate writing quality at a high level of granularity.

Use of adjectives in Catalan: A morphological characterization in different genres and modes of production through school-age development

Preprint

Full-text available

Feb 2017

Adjectives, like nouns and verbs, are one of the three major classes of lexical words. But, unlike nouns and verbs, they emerge late in acquisition. In Catalan, as in many other languages, their use is closely linked to the literate lexicon learned at school-age. Thus, the use of adjectives can be a good indicator of later language development. The goal of this study is twofold: to characterize the use of adjectives from age nine to adulthood and to examine the effect of discourse genre (expository and narrative) and mode of production (spoken and written) on frequency of use and word-internal morphological structure. The study takes a corpus-based approach and uses the GRERLI-CAT1 corpus, which contains 316 expository and narrative spoken and written texts produced by 79 Spanish/Catalan bilinguals whose home language is Catalan, at four age and schooling levels (primary school (9-to 10-year-olds), secondary school (12-to 13-year-olds), sixth form (16-to 18-year-olds) and university (adults)). Results show that the use of adjectives expands through school-age and First Language 0(0) especially from sixth form onwards, presenting an increasing pattern. An effect of genre and mode of production on the target features was also detected. Expository texts contain significantly more adjectives per text and clause and lower-frequency adjectives than narrative texts. Written texts contain significantly more adjectives, and lower-frequency and longer adjectives, than spoken texts. Age interacts with mode of production in the use and morphological complexity of adjectives. The four text types analysed (spoken expository, written expository, spoken narrative and written narrative) present a complexity cline, from written expository texts to spoken narratives through spoken expository texts and written narratives.

L'ús dels adjectius al llarg del desenvolupament tardà del català [Use of adjectives throughout later Catalan development]

Thesis

Full-text available

Sep 2017

Laia Cutillas i Alberich

Esdevenir parlants competents d’una llengua és un dels desenvolupaments més notables del nostre creixement com a persones. En aquesta tesi ens proposem estudiar l’adquisició del català com a primera llengua (L1), fixant-nos en el desenvolupament d’una categoria lèxica – l’adjectiu – al llarg de l’escolaritat i fins a l’edat adulta, en diferents gèneres discursius (expositiu i narratiu) i modalitats de producció (oral i escrita). Els adjectius són un dels recursos descriptius més potents de què disposa la llengua, l’aprenentatge i el perfeccionament del seu ús és un procés llarg que s’estén més enllà de l’adolescència i està íntimament lligat a l’escolaritat i l’alfabetització, per la qual cosa l’ús dels adjectius pot ser un bon indicador del desenvolupament del lèxic. Durant l’última dècada, diversos estudis han analitzat el creixement i la composició de la categoria adjectiu al llarg de l’etapa escolar, principalment en anglès (Marineille & Kneile, 2012; Nippold & Sun, 2008) i hebreu (Ravid & Levie, 2010). Estudis duts a terme en català també han mostrat els adjectius com una de les categories que més creixement experimenta amb l’edat (Cordero, 2002; Gili Gaya, 1972; Llauradó & Tolchinsky, 2013). El present treball és un estudi de l’ús dels adjectius des d’una aproximació corpus- based. Aplicarem aquesta metodologia a l’anàlisi del desenvolupament tardà del català. L’estudi es basa en el corpus GRERLI-CAT1, el qual consta de 316 textos produïts per 79 informants bilingües català/castellà, que tenen el català com a llengua d’ús habitual. Els informants pertanyen a diferents nivells educatius: Primària (9-10 anys), Secundària (12-13 anys), Batxillerat (16-18 anys) i Universitat (majors de 19 anys). Cada informant va produir quatre textos sobre els conflictes escolars: un expositiu oral, un expositiu escrit, un narratiu oral i un narratiu escrit. Compararem els textos, doncs, tenint en compte aquestes variables independents: l’edat/nivell educatiu, el gènere discursiu i la modalitat de producció. Els objectius generals són tres. En primer lloc, volem identificar quines mesures lèxiques serveixen per a diagnosticar el desenvolupament del lèxic i la diferenciació entre gèneres i modalitats. En segon lloc, volem caracteritzar l’ús dels adjectius segons l’edat i el tipus de text. Finalment, volem identificar quines característiques morfològiques i de freqüència d’ús dels adjectius expliquen millor els judicis dels parlants sobre la seva composicionalitat. Per a complir aquests objectius hem dut a terme tres estudis. El primer centrat en analitzar la diversitat i densitat lèxica, la llargada de paraula i la productivitat verbal, i la influència que hi tenen les variables independents del corpus. El segon encaminat a descriure l’ús dels adjectius segons l’edat, el gènere discursiu i la modalitat de producció; analitzant la quantitat, la freqüència i l’estructura morfològica interna dels adjectius. I el tercer dedicat a esbrinar de quina manera els parlants accedeixen al seu coneixement morfològic a l’hora de jutjar la composicionalitat dels adjectius extrets del corpus GRERLI-CAT1. Els resultats del primer estudi indiquen que la llargada de paraula és el millor diagnòstic del desenvolupament lèxic i de la diferenciació entre gèneres i modalitats. La diversitat i la densitat lèxiques també són bons indicadors evolutius. D’aquestes dues mesures, la primera serveix per a diferenciar entre modalitats i la segona és útil per a fer-ho entre gèneres. En canvi, la productivitat verbal no és apropiada per al nostre objectiu. Els resultats del segon estudi mostren que l’ús dels adjectius s’expandeix al llarg de l’escolaritat, donat que augmenta amb el nivell educatiu. També s’ha detectat un efecte del gènere discursiu i de la modalitat de producció. Els quatre tipus de text analitzats presenten una gradació en la seva complexitat, des dels expositius escrits fins als narratius orals, passant pels expositius orals i els narratius escrits. Els resultats del tercer estudi revelen que els parlants jutgen la composicionalitat dels adjectius en termes d’un contínuum i no pas en termes dicotòmics – adjectius simples vs. complexos. Els predictors més forts de la composicionalitat són la llargada de l’adjectiu, el número de morfemes comptats des del punt de vista sincrònic, la presència d’un prefix productiu i que la base sigui més freqüent que l’adjectiu derivat. En resum, l’anàlisi quantitativa del lèxic mostra que, a mesura que avancen pels nivells educatius, els informants utilitzen un lèxic més complex en els textos expositius i en la modalitat escrita. L’examen de l’ús dels adjectius ens ha permès obtenir una visió evolutiva de la riquesa lèxica i de com s’adapta el vocabulari usat a cada tipus de text. L’estudi sobre la composicionalitat dels adjectius indica que la complexitat morfològica afecta als usos lèxics, però que és un coneixement implícit dels parlants. Així doncs, el present estudi proveeix a la comunitat acadèmica d’una descripció acurada de l’ús dels adjectius al llarg del desenvolupament tardà del català. Disponible a: http://www.tdx.cat/handle/10803/458445 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Becoming proficient speakers of a language is one of the most prominent developments in our growth as people. This thesis is aimed to study Catalan acquisition as first language (L1), focusing on the development of a lexical category – the adjective – throughout school age and until adulthood, in different discourse genres (expository and narrative) and modes of production (spoken and written). Adjectives are one of the most powerful descriptive devices a language has; learning and improving its use is a long process which goes beyond adolescence and it is closely related to schooling and literacy. Therefore, the use of adjectives is a good indicator of lexical development. During the last decade, several studies have analysed the growth and composition of adjective category through the school years, mainly in English (Marineille & Kneile, 2012; Nippold & Sun, 2008) and Hebrew (Ravid & Levie, 2010). Studies which were carried out in Catalan have also shown adjectives as one of the categories which experience a higher increase with age (Cordero, 2002; Gili Gaya, 1972; Llauradó & Tolchinsky, 2013). In this thesis the study of the use of adjectives is tackled from a corpus-based approach. We apply this method to the analysis of later Catalan language development using the GRERLI-CAT1 corpus. The corpus contains 316 texts produced by 79 Spanish/Catalan bilinguals whose home language is Catalan. Subjects were from different schooling levels: Primary School (9- to 10-year-olds), Secondary School (12- to 13-year-olds), Sixth Form (16- to 18-year-olds) and University (more than 19). Every subject produced four texts about school conflicts: spoken expository, written expository, spoken narrative and written narrative. We are going to compare these texts taking into consideration the following independent variables: schooling level, discourse genre and mode of production. The general objectives are three. Firstly, to identify which lexical measures are used in order to diagnose the lexical development and the differences between discourse genres and modes of production. Secondly, to characterize the use of adjectives according to schooling level and text type. Finally, to identify which morphological features and frequency of use characteristics of adjectives explain in a better way the speakers’ judgements about their compositionality. In order to achieve these objectives we have carried out three studies. The first one is focused on the analysis of lexical diversity, lexical density, word length and productivity of verbs, and the influence of the corpus independent variables. The second one is focused on describing the use of adjectives according to schooling level, discourse genre and mode of production, analysing the amount, frequency of use and word-internal morphological structure of adjectives. And the third one is aimed to find out how speakers access to their morphological knowledge when judging the compositionality of the adjectives extracted from the GRERLI-CAT1 corpus. The results of the first study show that word length is the best diagnosis for lexical development, as well as genre and mode differentiation. The lexical diversity and density are also good evolutionary indicators. The first measurement is used to differentiate between modes of production and the second one is useful to differentiate discourse genres. However, productivity of verbs is not appropriate for our objective. The results of the second study show that the use of adjectives expands throughout school years, as it increases with school level. An effect of discourse genre and mode of production has also been detected. The analysed four text types present a complexity cline, from written expository texts to spoken narratives through spoken expository texts and written narratives. The results of the third study reveal that the speakers judge adjectives’ compositionality not in dichotomous terms – simple vs. complex adjectives – but in a continuum. The strongest compositionality predictors are adjective length, the number of morphemes counted from a synchronic point of view, the presence of a productive prefix and the relative frequency, that is, when the base is more frequent than the derived adjective. To sum up, the quantitative analysis of the lexicon shows that, as subjects move on through the different school levels, they use a more complex lexicon in expository texts and in the written mode. The analysis of the adjective usage allowed us to obtain an evolutionary vision of the lexical richness and the way how vocabulary is adapted to each text type. The study about the compositionality of adjectives indicates that the morphological complexity affects lexical usage, although it is an implicit knowledge of the speakers. So the present study provides the academic community with an accurate description of the use of adjectives throughout later Catalan language development. Available at: http://www.tdx.cat/handle/10803/458445

Lexical diversity as a predictor of genre in TV shows

Article

Mar 2023

Many studies have investigated the linguistic characteristics of television and have found important differences between categories of TV programs. Yet, little is known specifically about the lexical profiles of different genres of television discourse. The present study sought to address this gap by exploring the lexical diversity of 714 episodes representing four TV genres. The lexical diversity of each episode was measured using a six-dimensional model of lexical diversity. Multinomial logistic regression was used to determine whether the four TV genres in the present study have unique lexical diversity profiles and whether the genres of individual TV episodes can be predicted based on the adopted model. The results indicated that the four genres do indeed exhibit unique lexical diversity profiles; it was also found that the genres of individual TV episodes can be predicted with approximately 91% accuracy based on this model. These findings were interpreted as underscoring the relevance of lexical diversity to genre analysis of TV shows and the importance of using a theoretically grounded multivariate model of this construct.

Writing Competence and Lexical Diversity in EFL Context of Hungarian and Jordanian University Students

Article

Full-text available

Jan 2024

This quantitative correlational study investigates the relationship between lexical diversity (LD) and writing competence in English as a Foreign Language (EFL) among 66 Hungarian and 64 Jordanian university students, balanced for gender and English proficiency and all enrolled in English studies programs. Participants completed a Bilingual Language Profile (BLP), providing information about their English use, proficiency, and attitudes, and then crafted narratives following a silent film. Lexical diversity, assessed using Text Inspector (an online software), was measured through VocD and MTLD. Writing competence was evaluated using the holistic scoring approach. The study revealed a significant positive correlation between writing competence scores and LD measures VocD and MTLD in both groups, with slightly higher scores for Hungarians. The Hungarian group, predominantly at the C1 level, showed higher proficiency compared to B2-level-dominant Jordanians. Although gender differences in LD and writing competence were observed, they were not statistically significant. This research highlights the practical implications for educators, emphasizing the correlation between writing competence and Vocabulary knowledge in an EFL context.

Individual lexical breadth and its associated measures. A contribution to the calculation of individual lexical richness

Chapter

Full-text available

Sep 2023

In this chapter, we present an original model for the measurement of“indi-vidual lexical breadth”. With this construct, we try to represent lexical richnessbased on a solid and non-intuitive foundation, leading to the understanding thata speaker’s lexical breadth is the quantitative reflection of his or her degree oflexical richness. Thus, lexical breadth is a statistically proven quantitative mea-sure of lexical richness. The model presented here is based on both the size andspecificity of the vocabulary produced by a speaker during a lexical test, i.e., ontheir available lexicon. A principal difference between our measure of individuallexical breadth and others relating to lexical richness is that our measure reflectsa characteristic of the speakers and not of the written production they produce.We propose a number of indexes based on our model, notably, the Index of Lexi-cal Decentralization (ILD) and the Index of Lexical Amplitude (ILA). Our intentionis for these indexes to be used in addition to the four above-mentioned parame-ters (i.e., content density, lexical sophistication, lexical accuracy and lexical diver-sity) in the calculation of lexical richness.Our model involves a revised interpretation of the traditional concept of“lex-ical availability”, while still using association tests as source of data. It is based onprevious work in the fields of socio- and psycholinguistic research (see Ávila-Muñoz 2016) in which association tests proved useful, despite the simplicity withwhich they allowed linguistic information to be obtained (López-Morales 1999;Hernández, Izura, and Ellis 2006; Ávila-Muñoz and Villena-Posoda 2010: 177–279)

A multidimensional perspective on written language development ( Una perspectiva multidimensional del desarrollo del lenguaje escrito )

Article

Dec 2020

The development of writing is pivotal in the educational and communicative growth from childhood to adulthood. This special issue harnesses studies that examine the development of text writing in different discourse genres and languages, showcasing a set of text-embedded features that function as the blueprint for children’s evolving writing ability. These studies focus on both word-level and text-level features; a range of discourse genres; different language domains marked as indicators of writing quality; explanatory factors of writing quality; and a crosslinguistic scope encompassing language typology, the developmental span of writing and writing difficulties. The multidimensional perspective taken in this themed issue addresses the particular and universal aspects underlying the development of writing.

The nature and scope of later language development

Chapter

Full-text available

Jan 2004

Liliana Tolchinsky

Optimising Measures of Lexical Variation in EFL Learner Corpora

Chapter

Full-text available

Jan 1999

Parlar per escriure o escriure per parlar? [Speaking for writing or writing for speaking?]

Chapter

Full-text available

Nov 2013

Laia Cutillas i Alberich

A l’escola, una de les tasques més habituals a l’hora de parlar d’un tema és escriure una redacció o presentar una exposició oral. Tot i això, aquestes dues activitats gairebé sempre són exclusives: o es fa oralment o es fa per escrit, però no és habitual fer la mateixa activitat en les dues modalitats de producció; sembla com si posar per escrit una exposició que ja ha estat presentada oralment o exposar oralment un text escrit no aportés res de nou a l’aprenentatge dels nens. És per això que en aquest treball volem estudiar les diferències observables entre els textos que s’han fet primer de manera oral i després per escrit, i a la inversa; per a veure de quina manera influeix l’ordre de la modalitat de producció en la qualitat dels textos resultants: quins són més llargs? Quins tenen un lèxic més ric i variat? En definitiva, volem esbrinar quins textos són millors, els que s’han fet primer oralment i després per escrit, o els que s’han produït primer per escrit i després s’han exposat oralment. El corpus utilitzat per al present estudi està format per textos produïts per parlants de català com a primera llengua que realitzen quatre tipus de textos en diferents ordres de realització, segons el gènere discursiu i la modalitat de producció: expositiu oral, expositiu escrit, narratiu oral i narratiu escrit. Les característiques que estudiem són la diversitat lèxica, que és el percentatge de paraules diferents sobre el total de paraules d’un text; la llargada mitjana de paraula, mesura que permet veure si les paraules usades són morfològicament simples o complexes, i la densitat lèxica, és a dir, la proporció entre les paraules lèxiques (noms, adjectius, verbs...) i les funcionals (determinants, preposicions, pronoms...). Un cop comparades aquestes característiques dels textos, discutim quin ordre de modalitat de producció influeix positivament en els textos resultants, per així donar pistes als docents sobre com els nens poden millorar les seves habilitats discursives. [One of the most common tasks when talking about a topic at school is to write an essay or to make an oral presentation. However, these two activities are almost always exclusive: either they are done as a speaking task or they are done as a writing task. It is not usual to do the same activity in the two modalities of production, it seems as if to write a presentation that has already been presented orally or to present orally a written text brings nothing new to children’s learning. That is the reason why in this paper we study the observable differences between the texts that have been done first orally and then have been written, and vice versa, to see how the modality of production order may affect the quality of the resulting texts: Which are the longest ones? Which have a richer and more varied vocabulary? In sum, we want to find out which texts are better: the ones that have been done first orally and then have been written, or those that have been done first in writing and then have been orally exposed. The corpus used for this study consists of a set of texts produced by bilingual Spanish/Catalan speakers that have Catalan as their home language. They produced four types of texts in different production orders, according to text genre and modality of production: spoken expository, written expository, spoken narrative, and written narrative. The characteristics studied are lexical diversity, which is the percentage of different words relative to the total words in a text; word length, that can show if the words used are morphologically simple or complex; and lexical density , i.e. the proportion of lexical words (names, adjectives, verbs...) and functional words (determiners, prepositions, pronouns...). Once these texts characteristics have been compared, we discuss which production order has a positive impact on the resulting texts, so as to give clues to teachers about how children can improve their discursive skills.]

Corpus CesCa Compiling a corpus of written Catalan produced by school children

Article

Full-text available

Dec 2012

This paper outlines the compilation of a corpus of Catalan written production. The CesCa corpus presents a picture of the Catalan written language throughout compulsory schooling. It contains two kinds of data: Vocabularies of five semantic fields comprising 242,404 lexical forms and Textual data of four different discourse genres consisting of 207,028 tokens. Both vocabularies and the textual data have been morphologically analyzed and lemmatized. The corpus is freely available. This paper will outline the main features of the corpus and make some suggestions as to the uses to which the corpus can be put.

Toward a cross-linguistic comparison of lexical quantain speech and writing

Article

Full-text available

Mar 2002

The present study presents contrastive analyses of task-oriented spoken and written discourse in terms of lexical diversity, lexical density, and word length. In an age-matched within-language comparison (Swedish), written discourse consistently scored higher on these measures. It is suggested that the same type of differences will hold for any language, because of the difference between speech and writing in processing constraints. The absolute scores, however, can vary substantially for reasons of language typology. An extended, cross-linguistic analysis (English, Hebrew, Icelandic, Swedish), focusing on word length, was made to substantiate that claim. Further, cross-age-group comparisons of lexical quanta indicated a dynamic interaction between speech and writing in development. Spoken discourse eventually comes to “learn from the development of writing.

Cross-Linguistic Perspectives on the Development of Text-Production Abilities: Speech and Writing

Article

Full-text available

Mar 2002

The studies reported in this volume of WL&L (5: 1–2, 2002) all derive from a joint project entitled “Developing literacy in different contexts and in different languages”, funded by the Spencer Foundation, Chicago. The study encompasses seven languages — Dutch, English, French, Hebrew, Icelandic, Spanish, and Swedish — for which data were collected in Europe, Israel, and the US by graduate research assistants in education, linguistics, and psychology, under the supervision of a project director in each country — each of whom is listed as a first or “lead” author in the articles which follow the introduction to this collection.

Growth of text-embedded lexicon in Catalan: From childhood to adolescence

Article

Nov 2013

Lexical development is a key facet of later language development. To characterize the linguistic knowledge of school age children, performance in the written modality must also be considered. This study tracks the growth of written text-embedded lexicon in Catalan-speaking children and adolescents. Participants (N = 2161), aged from 5 to 16 years produced six different texts: a film explanation, a film recommendation, a joke telling and definitions of a noun, a verb and an adjective. The resultant corpus of 11,332 texts was analyzed using four distributional measures of lexical development: word length, lexical density, use of adjectives and nominalizations. Heylighen’s F-measure of level of text formality was also computed. Word length, use of adjectives and nominalizations were powerful indicators of lexical development. Text type and home language had an effect on these measures. Lexical density showed no clear developmental change, and did not vary by type of text. Heylighen’s F-measure was a weaker developmental indicator. Educational implications are discussed.

Relating Events In Narrative: A Cross Linguistic Developmental Study

Article

Dec 1995

EAGLES recommendations for the morphosyntactic annotation of corpora

Article

Learner Corpora and their Potential for Language Teaching

Article

Jan 2004

Nadja Nesselhauf

Indicators of lexical growth throughout age, genre and modality for a Catalan L1 corpus

Abstract and Figures

Recommended publications

Perceived and Actual Linguistic Competence: A Descriptive Study of Four Low-Achieving Hispanic Bilin...

An examination of language input and vocabulary development of young Latino dual language learners l...

Cross-Linguistic Lexical, Grammatical, and Discourse Performance on Oral Narrative Retells Among You...

Preparing Latina/o Bilingual Teachers to Teach Content in Spanish to Emergent Bilingual Students on...