Content uploaded by Miloš Jakubíček
Author content
All content in this area was uploaded by Miloš Jakubíček on Mar 06, 2023
Content may be subject to copyright.
5
PtTenTen: A Corpus for Portuguese
Lexicography
Adam Kilgarri, Miloš Jakubíček, Jan Pomikalek, Tony Berber Sardinha
and Pete Whitelock
1. Introduction
ere are a number of ways in which corpus technology can support lexicography,
as described in Rundell and Kilgarri (2011). It can make it more accurate, more
consistent and faster. But how might those potential benets pan out in an actual
project? If starting from a blank sheet of paper, how should one proceed?
In this paper we describe such an exercise. Oxford University Press is preparing
the Oxford Portuguese Dictionary, a new Portuguese–English, English–Portuguese
dictionary. It will cover both Brazilian and European Portuguese, with dierences of
words, spelling and usages noted. Eachside will contain around 40,000 headwords
and 200,000 meanings. e work here concerns the new analysis of Portuguese for the
Portuguese-source side.
e components of the process are:
Collect the corpus
Process it with the best available tools for the language
From parser output to corpus system
First, we present the end point of the process: high quality word sketches for
Portuguese within the Sketch Engine corpus query tool. en in the next three
sections we describe the process of getting there. Last, we present an analysis of the
contrast between Brazilian and European Portuguese.
2. Word sketches and the Sketch Engine
Word sketches are one-page automatic, corpus-based summaries of a word’s
grammatical and collocational behaviour. eir value for lexicographic work in
English and other languages, as well as the background of the use of corpora in
lexicography, have been described elsewhere (Kilgarri et al., 2004).
First, we introduce corpus query systems and the basic idea of word sketches. Next,
we present word sketches for Portuguese.
9781441190505_txt_print.indd 111 02/12/2013 12:03
112 Working with Portuguese Corpora
2.1 Corpus Query Systems
A variety of corpus query systems (CQSs) have been used to examine corpus evidence
since the rise of the rst electronic corpora. Starting with the ground-breaking
COBUILD project, lexicographers have been using KWIC (Key Word In Context)
concordances as their primary tool for nding out how a word behaves. Later, with
the growth of corpora, lexical statistics had to be applied to manage the abundant data
and highlight the most salient combinations and collocations. Today, state-of-the-art
CQSs allow the lexicographer great exibility in searching for phrases, collocates,
grammatical patterns, sorting concordances according to a wide range of criteria,
selecting subcorpora for searching in, say, only spoken text, academic text, or only
ction. Available systems include WordSmith (Scott, 2008), MonoConc (Barlow,
2000), the Stuttgart Corpus Workbench (CWB, Christ and Schulze, 1994) and Davies’s
SQL architecture (see Davies, this volume).
2.2 e Sketch Engine
e Sketch Engine is a corpus query system which gives access to the familiar CQS
functions: concordances for several types of queries (simple, lemma, phrase, word
form and CQL), with an integrated context-control lter.
e interface includes a variety of viewing and sorting options.
Figure 5.1 Screenshot of concordance interface
9781441190505_txt_print.indd 112 02/12/2013 12:03
PtTenTen: A Corpus for Portuguese Lexicography 113
However, the features of the Sketch Engine which are of special interest in this paper
are not part of standard concordancing programs. ese features include word
sketches, sketch dierences and a thesaurus. ey are all fully integrated with standard
concordancing.
2.3 Word sketches
To identify a word’s grammatical and collocational behaviour, the Sketch Engine
needs to know how to identify words connected by a grammatical relation. is can
be achieved in one of two ways.
e rst possibility is to use a ‘sketch grammar’: the input corpus is loaded into the
Sketch Engine, part-of-speech-tagged but not parsed, and the Sketch Engine supports
the process of identifying grammatical relation instances through a grammar written
as regular expressions over part-of-speech tags.
In the second approach, we parse the input corpus, so that the information about which
word-instances stand in which grammatical relations with which other word-instances is
embedded in the corpus. Currently, dependency-based syntactically annotated corpora
are supported. Phrase-structured trees need heads of phrases to be marked.
For most languages where word sketches have been built, the rst method was
used. e work described in this paper is the rst large-scale use of parser output to
create word sketches.
Figure 5.2 Screenshot of concordance viewing and sorting options
9781441190505_txt_print.indd 113 02/12/2013 12:03
114 Working with Portuguese Corpora
One of the hardest parts of the lexicographer’s task is not to miss anything. In
Figure 5.3, we see a Word Sketch for the lemma pulso, which has a frequency of 22,328
in the corpus. An inspection of the Sketch shows ve basic senses. e rst sense
refers to the joint that connects the hand to the arm (wrist), indicated by collocates
such as relógio (watch), fratura (fracture) and esquerdo (le). e second one (pulse)
is medical and means the beat that results from the passing of blood through the
arteries and veins – it is revealed by collocates such as femoral (femoral), auscultar
(auscultate) and basal (basal). e third one (pulse), related to telephony, is revealed
by words such telefônico (telephone), tarifação (pricing) and tom (tone). e fourth
one is specic to electricity and collocates with tensão (tension), corrente (current) and
eletromagnético (electromagnetic). e last sense is gurative and denotes a measure
of strength (hand), forming collocations with rme (rm), rmeza (rmness) and
a number of verbs such as governar (govern), comandar (command), agir (act) and
administrar (manage). An inspection of these collocations will show the pervasiveness
of the idiom ‘com pulso rme’ (with a rm/strong hand).
Figure 5.3 Partial screenshot of Word Sketch for lemma pulso
Word Sketches can also remind lexicographers to include less obvious meanings and
idioms. e Sketch for the lemma pendurar (hang; frequency of 22,793) includes
both of these. For instance, the idiom pendurar o beiço (literally to drop one’s
lower lip) is a less common way of saying fazer bico (make a grimace). Pendurar a
chuteira (literally to hang one’s soccer spikes) means to end one’s career. is same
meaning is conveyed by pendurar o paletó (literally to hang one’s jacket), which has
yet another meaning: to pretend to work (quem não trabalha, pendura o paletó –
literally those who don’t work, just hang their jackets). Another sense of pendurar
is that of doing a seemingly endless task as in pendurado ao telefone (hanging on
the telephone).
9781441190505_txt_print.indd 114 02/12/2013 12:03
PtTenTen: A Corpus for Portuguese Lexicography 115
3. Corpus collection
Corpora for lexicography should be large and diverse. If they are, they will provide
evidence about anything that should be in the dictionary. If they are not, they will miss
things. Our experience with English shows that, in order to get a full account for each
of 40,000 words of a language – even the least frequent of them – we need a corpus of
at least a billion words.
Where might a corpus of that size, covering a very wide range of text types, be
found? e answer is the web. ere is now substantial evidence that web corpora,
created through the same process of web crawling that the search engines use, oer
diverse and very large corpora which compare well with designed collections (Baroni
et al., 2009; Sharo, 2006). Informal and speech-like genres tend to be better repre-
sented in web corpora than in many curated corpora, since they contain material from
blogs and similar, while curated corpora in the order of a billion words are likely to
include high proportions of journalism, the easiest text type to obtain in bulk. While
there is no easy answer to the question ‘what text types, and in what proportions, do
we get in a web corpus’, we show below that they provide good lexicographic resources.
e Portuguese corpus described here is one of the ‘TenTen family’ of corpora
(Jakubíček et al., 2013).
3.1 Crawling
e Portuguese corpus was gathered in two parts, the rst for European (crawling only
in the .pt domain), the second for Brazilian (.br domain). Following Baroni et al., we
used the Heritrix crawler (http://crawler.archive.org/) and set it up to download only
documents of mime type text/html and between 5 and 200KB in size. e rationale
of mime type restriction is to avoid technical diculties with converting non-HTML
documents to plain text. e size limit weeds out documents that are too small, which
Figure 5.4 Partial screenshot of Word Sketch for lemma pendurar
9781441190505_txt_print.indd 115 02/12/2013 12:03
116 Working with Portuguese Corpora
typically contain almost no text, and very large documents, which are very likely to be
lists of various sorts. Table 5.1 summarizes the sizes of the downloaded data as well as
the time required for crawling.
3.2 Junk
We do not want our Portuguese corpus to contain material that is not Portuguese text.
We do not want it to contain navigation bars, banner advertisements, menus, formatting
declarations, JavaScript, html, or material in languages other than Portuguese. It is also
important that we represent all texts in a single character encoding (preferably UTF-8)
in order to prevent incorrect character display.
Detecting original character encoding of each document is our rst step, for which
we use the chared tool.1 Once we know what the original encoding is, converting it to
UTF-8 is straightforward.
Next, we remove junk (navigation links, advertisements, etc.) with jusText.2 We run
it with the inbuilt Portuguese model and with the default settings.
In order to preserve only texts in Portuguese, we apply the Trigram Python class for
language detection using character trigrams.3 We train a Portuguese language model
from a 150,000 word text sample taken from Wikipedia and discard all documents
for which the similarity score with the language model is below 0.4. is threshold is
based on the results of our previous experiments.
e rst manual examination of the corpus data revealed a substantial amount
of English text despite the applied language ltering. It turned out that there are
numerous documents in the corpus which contain half-Portuguese, half-English
paragraphs and score slightly above the language ltering threshold. To x this
problem, we applied further anti-English ltering. We compiled a list of the 500
most frequent words of English and removed from the corpus all paragraphs longer
than 50 words where the frequent English words accounted for over 10 percent of
the words.
3.3 Duplicates
Duplicates (and, worse still, many-times-replicated material) are bad both because the
lexicographer wastes time passing over concordance lines they have already seen and
because they distort and invalidate statistics.
A central question regarding duplication is ‘at what level’? Do we want to remove
all duplicate sentences, or all duplicate documents?
Table 5.1 Web crawling stats
European Portuguese Brazilian Portuguese
HTML data downloaded 1.10 TB 1.37 TB
Unique URLs 31.5 million 39.1 million
Crawling time 8 days (1–8 Mar 2011) 10 days (1–10 Jun 2011)
9781441190505_txt_print.indd 116 02/12/2013 12:03
PtTenTen: A Corpus for Portuguese Lexicography 117
For lexicographic work and other research at the level of lexis and syntax, the
sentence is too small a unit, because if we remove all but one copy of a short sentence
such as ‘Yes it is’ or ‘Who’s there?’ the remaining text will lose coherence and be hard
to interpret. e whole document is too large a unit because we do not want to include
long sections of text twice over where one appeared in document X and the other in
document Y, and the other parts of document X did not duplicate the other parts of
document Y.
e appropriate unit is the paragraph. We identify paragraphs, and then take
additional steps to handle short paragraphs (including dialogue turns like ‘Yes it is’),
only removing them if their context is also duplicate material.
A naïve approach to de-duplication results in a process that gets slower per million
words, the larger the corpus (since there are more already-seen paragraphs to compare
a new paragraph with). Our approach increases linearly with the size of the corpus.
We de-duplicate aer cleaning, since this reduces the bulk of material to de-duplicate.
e de-duplication process was applied separately for the European and Brazilian
parts. Ittook 4 hours and 5 hours respectively on a single Intel Xeon 2.13GHz CPU
and removed 75 percent and 68 percent of the cleaned material that we had gathered,
leaving 804 million tokens of European Portuguese and 3.19 billion of Brazilian.
4. Language technology tools for processing Portuguese
e prospects for getting the computer to help the lexicographer are improved if the
text is lemmatized, part-of-speech-tagged and parsed. e lexicographer can then ask
queries about lemmas, word classes and grammatical relations (‘what nouns oen
occur as objects of this verb?’) as well as about word forms and positions (‘what words
oen come between two and ve words aer this word?’). We shall be able to provide
better reports to the lexicographer.
We investigated past research on the computational processing of Portuguese
(e.g., Santos et al., 2008) and established that the leading system was PALAVRAS
(Bick, 2000; see Bick’s chapter in this volume). Further investigation revealed that
PALAVRAS development has been ongoing for over ten years and did not reveal any
newcomers that looked better. We concluded that it was probably, in 2011, the most
accurate soware for processing Portuguese. We contacted the author and negotiated
a license.
Parsing tends to be a slow process. One concern of ours was that parsing a 2 billion
word corpus would take months or even years.
We parallelized the processing by splitting the corpus into 12 parts and parsing all of
them at the same time on a double 12-core AMD Opteron 800 MHz server. We experi-
enced technical problems with the parser and had to re-start several times with soware
bug xes and updates obtained from the developers upon our error reports. Despite good
technical support, we were unable to parse the whole data set in a single run without the
process dying. In the end, we split the data into many les of around 10MB and ran a
fresh instance of PALAVRAS for each le. In the nal run, using 12 concurrently running
instances of the parser, the processing of the whole data set took 15 days.
9781441190505_txt_print.indd 117 02/12/2013 12:03
118 Working with Portuguese Corpora
e parser crashed on most of the input les. Nevertheless, in most cases it
managed to process a signicant part of the input rst. A substantial part of the corpus
data was lost during parsing. e nal size of the corpus is 773 million tokens for the
European part and 1.2 billion tokens for the Brazilian.
5. Into the Sketch Engine
PALAVRAS is a dependency parser. In dependency grammar, the structure of a
sentence is identied via a set of labelled dependency links from each word to its
governor. For each word in a sentence, PALAVRAS output provides the lemma, the
part-of-speech tag, the name of the grammatical relation in which it stands to its
governor, and a pointer to its governor.
Although the dependency relations computed by PALAVRAS are eminently
suitable for the generation of word sketches, there are many minor ways in which
PALAVRAS output is incompatible with or insucient for the demands of a practical
lexicographic tool. us an extensive post-processing phase takes place to adjust
PALAVRAS output and enrich it in a variety of ways.
In order to explicitly represent a variety of dependencies, PALAVRAS deconstructs
items such as preposition–article contractions and verbs with inxed pronominal
objects. For instance, the contraction dos (of the; plural) becomes two separate words
(de os) with distinct dependencies, while the verb form levá-lo-á (will lead you)
becomes two separate words (levará, o). It was necessary to reconstruct the surface
forms lost by PALAVRAS in order that the lexicographer can extract illustrative
examples from the corpus with minimal diculty.
PALAVRAS also treats a wide variety of multi-word units (e.g., compound nouns
such as direitos humanos, as well as many others) as single items in the dependency
structure. Untreated, this would have the unfortunate eect of omitting the component
words from each other’s word sketches. A simple parser was developed to establish
the internal dependency structure and headedness of such units and the result was
plugged back into the larger structure with the correct dependencies.
In providing each word with a single governor, PALAVRAS does not explicitly
capture relations of importance for complete word sketches. For instance, in the
phrase é viável sua aplicação (its application is viable), a subject relation is established
between aplicação and ser (é). Post-processing adds in the controlled subject relation
between aplicação and viável, information which may be important in the sketch for
these two lemmas. In general, a noun phrase subject will get a subject relation to each
verb or adjective in an auxiliary sequence (or an object relation if the verb is passive).
Another type of relation that is added is the trinary relation corresponding to a
prepositional phrase and its attachment site. PALAVRAS generates binary relations
between the preposition and its governor, and between the preposition and its object.
Post-processing adds in the composition of these two, so that each full lexical item will
appear on the sketch for the other, in a table headed by the preposition.
A similar treatment is followed for coordination. It is oen useful for the lexicog-
rapher to see the words with which a headword occurs oen, for example arroz e feijão
9781441190505_txt_print.indd 118 02/12/2013 12:03
PtTenTen: A Corpus for Portuguese Lexicography 119
(rice and beans). In dependency grammar, the two conjuncts do not stand in a relation
to each other so we post-processed to create a relation between the heads of the two
conjuncts, so that once again they appear on each other’s sketches.
As well as augmenting the relations correctly computed by PALAVRAS with
various others, it is desirable to correct some of the decisions made by the parser.
Betraying its lack of statistical processing, PALAVRAS oen attaches constituents
to remote heads in ways that may be linguistically possible but are much less likely
than the more proximate attachments. For instance, in the phrase dedicam-se aos
temas contemporâneos (is dedicated to contemporary themes), PALAVRAS’s choice of
dedicar (dedicam-se) as the governor of contemporâneo is jettisoned in favour of the
much more plausible tema.
Finally, for the purpose of collecting as much data as possible within sketches,
spelling variations are neutralized in the lemma chosen for each word, with modern
Brazilian spelling being used as the standard.
6. Regional variants
ere are two main regional variants of Portuguese: Brazilian and European. We
had corresponding subcorpora within the corpus as a whole and the Sketch Engine
provides a keywords function that can list, in order, all words according to how
distinctively Brazilian or European they were. A classication of these words is shown
in Tables 5.2 through 5.12.
Table 5.2 shows geographical keywords. e Brazilian list includes adjectives pertaining
to persons born in particular Brazilian states, like carioca (from Rio de Janeiro state),
gaúcho (from Rio Grande do Sul state) and paulista (from São Paulo state), whereas
the Portuguese list includes a reference to Europe (europeu) and Cape Verde (cabo).
Table 5.2 Keywords: Geographical adjectives
Brazilian ptTenTen European ptTenTen
brasileiro (Brazilian)
carioca (from Rio de Janeiro state)
gaúcho (from Rio Grande do Sul state)
paulista (from São Paulo state)
europeu (European)
português (Portuguese)
cabo (Cape-(Verde))
euro (also the currency)
Table 5.3 Keywords: Administrative divisions
Brazilian ptTenTen European ptTenTen
bairro (neighbourhood)
cidade (city), município (city)
prefeitura (city council)
estadual (state, adj.)
federal
freguesia (neighbourhood)
concelho (city)
junta (city council)
aldeia (village)
autarquia (autonomous state organ)
9781441190505_txt_print.indd 119 02/12/2013 12:03
120 Working with Portuguese Corpora
e keywords also reect administrative and governmental terms that are specic
to each country (see Table 5.3). e main national administrative levels for both
countries are reected by words for the neighbourhood (bairro, Brazil; freguesia,
Portugal), the county/city/village (município and cidade, Brazil; concelho and aldeia,
Portugal), the state/province (estadual (adj.), Brazil; distrito, Portugal) and the feder-
ation (federal (adj.), Brazil). Terms for local city governments are prefeitura (city hall,
Brazil) and junta (Portugal).
Table 5.4 Keywords: Administration and politics
Brazilian ptTenTen European ptTenTen
prefeito (mayor)
delegacia (police station)
policial (police ocer)
deputado (deputy)
governador (governor)
secretário (secretary)
senado (senate)
senador (senator)
vereador (city council member)
polícia (police)
secretaria (secretary)
autarca (mayor)
e political keywords (see Table 5.4) are full of specic Brazilian terms, based on the
presidential system of government (vereador, governador, deputado, senador, etc.).
e only pair that applies to both variants is the one for mayor: prefeito (Brazil) and
autarca (Portugal).
Table 5.5 Keywords: Grammatical words
Brazilian ptTenTen European ptTenTen
diante (in front of, in view of)
você (you, 2nd p. sing.)
porém (but)
perante (in front of, in view of)
teu (yours, 2nd p. sing.)
vosso (yours, 2nd p. pl.)
vós (you, 2nd p. pl.)
este (this)
isto (this)
quer (whether, want [verb])
aquando (when)
e grammatical keywords (see Table 5.5) have interesting dialectal choices. Some
are known dierences between the two varieties, such as vós (you) and vosso (yours),
which are common in Portugal but have largely been replaced with você and seus
in Brazil. It is intriguing to note that second-person pronouns are also keywords of
Peninsular versus Latin American Spanish (Kilgarri and Renau, 2013); in addition,
in informal English, the second-person plural has regionally dierentiated forms,
with ‘y’all’ in the southern US, ‘you guys’ in the Northeast and Canada and ‘you lot’
in Britain.
9781441190505_txt_print.indd 120 02/12/2013 12:03
PtTenTen: A Corpus for Portuguese Lexicography 121
Este (determiner or demonstrative pronoun) and isto (demonstrative pronoun),
keywords of European Portuguese, are traditionally used to indicate referents that are
close to the speaker, as opposed to esse and isso, which refer to referents that are near
the interlocutor. is distinction is still largely observed in Portugal, but is rapidly
disappearing in Brazil, where esse and isso have taken over.
e business keywords (see Table 5.6) are predominantly Brazilian, but the words are
known in both countries. A number of these are distinct by spelling: diretor (male
director), diretora (female director) and diretoria (the director’s oce) are spelled with
an intervening ‘c’ in Portugal: director, directora and direcção.
Table 5.6 Keywords: Business terms
Brazilian ptTenTen European ptTenTen
diretoria (director’s oce)
planejamento (planning)
diretor (director, male)
diretora (director, female)
gerente (manager)
assessoria (secretary)
atendimento (care)
capacitação (training)
demanda (demand)
etapa (phase)
pauta (agenda)
treinamento (training)
vaga (opening)
convênio (health insurance; agreement)
direcção (director’s oce)
planeamento (planning)
As previously mentioned, in technology we nd many terms that are unique to each
country (see Table 5.7). With the exception of celular (cell phone, Brazil) and telemóvel
(mobile phone, Portugal), they are computing words (assessar, to access; usuário, user,
both in Brazil; aceder, to access, ecrã, screen; cheiro, le; utente/utilizador, user; and
Table 5.7 Keywords: Technology
Brazilian ptTenTen European ptTenTen
acessar (access)
celular (cell phone)
usuário (user)
busca (search)
aceder (access)
telemóvel (cell phone)
utente (user)
ecrã (screen)
cheiro (le)
utilizador (user)
ligação (link)
utilização (use)
domínio (domain)
informático (computational)
9781441190505_txt_print.indd 121 02/12/2013 12:03
122 Working with Portuguese Corpora
ligação, link, in Portugal). Brazilian speakers will be more familiar with compartilhar
(to share) than with partilhar, which is preferred in Portugal. ey are both used with
the sense of sharing information on the web and it is interesting that each variety has
selected a dierent word to express that same meaning, when online communication
might suggest otherwise. e online community in both Portugal and Brazil seems
to have a set of vocabulary specic to each country, which is revealed by words like
informático (informatic), which in Brazil would be computacional (computational) or
de computador (*of computer), or many of the words in the technology grouping, such
as usuário (user) in Brazil versus utente and utilizador in Portugal. Other words in this
category predate the web, such as ecrã (screen) and cheiro (le) in Portugal, which
are tela and arquivo in Brazil, respectively, and are widely known dialectal markers.
Table 5.8 Keywords: Sports
Brazilian ptTenTen European ptTenTen
esporte (sport)
esportivo (sports [adj.])
gol (goal)
equipe (team), time (team)
rodada (round)
copa (cup)
desporto (sports)
desportivo (sports [adj.])
golo (goal)
equipa (team)
As with technology, in sports (see Table 5.8) each country has a large set of unique
terms, a number of which are oen cited to illustrate the vocabulary dierences
between the two dialects. Some of these did show up on the list, like the word for
goal (in soccer or similar sports), which Brazilians call a gol and the Portuguese golo,
or for team, which is time or equipe in Brazil and equipa in Portugal. Some of these
words are borrowings from either English or French, which in turn explains some of
the dierences between the two variants in other areas as well, such as computing (see
Table 5.7), where Brazilians tend to either adapt or borrow English terms wholesale
(e.g., mouse, drive, backup, deletar [to delete]), while the Portuguese tend to follow the
French terminology (e.g., ecrã and cheiro from the French écran and chier).
Table 5.9 Keywords: Weekdays
Brazilian ptTenTen European ptTenTen
segunda-feira (Monday)
terça-feira (Tuesday)
quarta-feira (Wednesday)
quinta-feira (ursday)
sexta-feira (Friday)
(none)
Weekdays turned up as Brazilian keywords (see Table 5.9), which is puzzling as
the same words are used in both countries to name the weekdays. In both variants,
weekdays are named in an ordinal manner, in such a way that Monday is called the
second day (Segunda-feira, from the Latin ‘Feria Secunda,’ the second free day in
Easter), Tuesday the third day (Terça-feira, from the Latin ‘Feria Tertia’), Wednesday
9781441190505_txt_print.indd 122 02/12/2013 12:03
PtTenTen: A Corpus for Portuguese Lexicography 123
the fourth day (Quarta-feira, ‘Feria Quarta’), ursday the h day (Quinta-feira,
‘Feria Quinta’) and Friday the sixth day (Sexta-feira, ‘Feria Sexta’). e feira (from
the Latin feria, meaning ‘free day’) is optional, so that one may say for example terça
to mean terça-feira (for Tuesday). All of these forms, with the exception of terça, are
regular ordinal numbers (in the feminine gender) as well. To nd out the source of
variation, for each dialect, we pulled out all weekday words from the subcorpus word
frequency list, excluding segunda (considered to be an outlier, as it alone accounted
for more than 20 percent of the combined frequencies, probably because of its use
as an ordinal numeral), computed their normed counts (per one million words;
pmw) and calculated the mean normed frequencies; we then contrasted the means
statistically and found a statistical dierence for hyphenated words (e.g., quarta-feira)
but not the unhyphenated ones (e.g., quarta). e mean frequency for hyphenated
weekdays is higher for Brazil (29.1 pmw, Brazil, 10.8 pmw, Portugal; t = 2.516, df =
19, p = .021), thereby accounting for their keyword status in Brazilian Portuguese.
For the unhyphenated forms, the mean is higher for Portugal (13.0 vs. 11.6, Brazil),
but not signicantly so (t = -.323, df = 16, p = .751). In Halliday’s (1991) terms, this
suggests that the system for weekdays in Portugal is equiprobable, whereas in Brazil, it
is heavily skewed in favour of the full form (71 percent).
Tables 5.11, 5.12 and 5.13 present nouns, adjectives, verbs and adverbs that turned
up as markers of each dialect but did not t neatly into the previous categories.
Adverb keywords (see Table 5.10) reect interesting choices. For instance, design-
adamente and nomeadamente are used in Portugal to mean roughly ‘namely’ but are
very rare in Brazil. Brazilian Portuguese lacks an immediate equivalent and Brazilian
speakers typically paraphrase this with expressions such as isto é or a saber. e words
principalmente (Brazil) and sobretudo (Portugal; both meaning ‘mainly’) are subtle
markers of each dialect; the fact that they turned up as keywords demonstrates the
strength of both the corpus and the comparative approach.
e noun keywords (see Table 5.11) include many that can be accounted for
by minor spelling dierences: the Brazilian controle (control) is spelled controlo
in Portugal. Bilhão, in Brazil, is spelled bilião in Portugal, but the Brazilian word
Table 5.10 Keywords: Adjectives and adverbs
PoS Brazilian ptTenTen European ptTenTen
Adjective ruim (bad)
grosso (thick, uncouth)
elevado (high)
habitual (habitual)
respectivo (respective)
secundário (secondary)
vasto (vast)
Adverb demais (too)
principalmente (mainly)
somente (only)
inclusive (including)
demasiado (too)
sobretudo (mainly, moreover)
designadamente (namely)
nomeadamente (namely)
igualmente (equally)
relativamente (relatively)
9781441190505_txt_print.indd 123 02/12/2013 12:03
124 Working with Portuguese Corpora
(borrowed from the American system) means 109, which in turn is mil milhões (a
thousand million) in Portugal, whereas the Portuguese bilião (inspired by the French
system) means 1012 and is a trilhão in Brazil. e Brazilian mídia (media) is spelled
média in Portugal. Other examples include planejamento (planning, Brazil), planea-
mento (Portugal); registro (registration/registry, Brazil), registo (Portugal); convênio
(agreement, Brazil), convénio (Portugal); acção (action, Portugal), ação (Brazil);
and facto (fact, Portugal), fato (Brazil). Other keywords are motivated by sux-
ation: contributo (contribution, Portugal) is contribuição (Brazil) whereas deslocação
(movement, Portugal) is deslocamento (Brazil). Other nouns are lexical choices, such
as pesquisador in Brazil but an investigador in Portugal (which, in Brazil, would be
more readily associated with a police detective) and fazenda (farm) in Brazil compared
to a quinta in Portugal.
e verb keywords (see Table 5.12) reveal lesser-known dialectal choices. One
of these has do to with ‘highlight’ words: ressaltar (to highlight) is more typical of
Table 5.11 Keywords: Nouns
Brazilian ptTenTen European ptTenTen
registro (registration)*
bilhão (billion)
chance (chance)
controle (control)*
disputa (dispute)*
fazenda (farm)
foco (focus)*
implantação (implementation)
integrante (member)
mato (bushes)*
mina (mine)*
mídia (media)
palestra (conference)
programação (program, schedule)
renda (income)*
rodovia (highway)
show
trecho (stretch; leg of a trip)
investigador (researcher)
edifício (building)
zona (zone)
investigação (research)
cotidiano (everyday)
morador (dweller)
pesquisador (researcher)
pesquisa (research/search)
item
registo (registration)*
altura (height)
aposta (bet)*
castelo (castle)
cento (one hundred; (per) cent)
colaboração (collcaboration)
concerto (concert)
contributo (contribution)
deslocação (movement)
dimensão (dimension)
elemento (element)
face (face)
gama (range)*
intervenção (intervention)
nível (level)
percurso (journey)
pormenor (particular)
procura (search)*
recolha (collection)*
restante (remainder)
vertente (aspect)
acção (action)
controlo (control)*
facto (fact)
regresso (return)*
*Also a verb form
9781441190505_txt_print.indd 124 02/12/2013 12:03
PtTenTen: A Corpus for Portuguese Lexicography 125
Brazilian Portuguese whereas the near synonyms assinalar, sublinhar, salientar and
realçar (all meaning ‘to highlight’ or ‘to underscore’) are more common in Portugal.
Meter (to put) is common in European Portuguese, but less so in Brazil, where it can
have a rude connotation, generally meaning pushing or forcing something into a
place. Verbal equivalencies also include words meaning (a) to pick up: pegar (Brazil),
apanhar (Portugal); (b) to return: retornar (Brazil), regressar (Portugal); (c) to widen:
ampliar (Brazil), alargar (Portugal); and (d) to register: registrar (Brazil), registar
(Portugal). Keywords motivated by spelling include the Brazilian choices atuar (to act;
which is actuar in Portugal) and planejar (to plan, planear in Portugal).
One tool that the lexicographer can use to explore the keywords in Sketch Engine
is Sketch-Di, which enables the researcher to compare the collocate sets of dierent
lemmas or to contrast the collocates of a single lemma across dierent subcorpora.
To illustrate, when we compared the lemma partilhar (to share) in Brazilian versus
European meanings, an inspection of the Sketch-Di showed that the Brazilian
meaning is restricted to the division of property in divorce or in a will, including
Table 5.12 Keywords: Verbs
Brazilian ptTenTen European ptTenTen
registrar (to register)
atender (to take care of)
atuar (to work)
buscar (to search)
cobrar (to charge; to demand)
conversar (to talk)
encaminhar (to send)
rmar (to sign)
liberar (to free)
ocorrer (to happen)
planejar (to plan)
repassar (to transfer)
pegar (pick up)
retornar (return)
ampliar (to widen; to increase)
morar (to live)
compartilhar (to share)
ressaltar (to highlight)
registar (to register)
atribuir (to assign)
calhar (to come in handy)
constituir (to constitute)
contatar (to contact)
efetuar (to carry out)
equipar (to outt)
gerir (to manage)
permitir (to allow)
pretender (to intend)
recordar (to remember)
referir (to refer)
apanhar (to pick up)
regressar (to return)
pôr (to put)
meter (put)
partilhar (to share)
assinalar (to highlight)
sublinhar (to highlight)
salientar (to highlight)
realçar (to highlight)
alargar (to widen)
situar (to be at; to place)
proceder (to proceed)
arranjar (to get)
distinguir (to distinguish)
decorrer (to follow)
9781441190505_txt_print.indd 125 02/12/2013 12:03
126 Working with Portuguese Corpora
collocates such as judicial, amigável (amicable), consensual, tabelião (notary) and
cartório (county clerk); meanwhile, the European meaning appeared far broader,
including informação (information), paixão (passion), visão (view) and ideias (ideas)
in addition to the same legal meaning as in Brazil. With compartilhar, the reverse
was true. Although both varieties shared a handful of collocates, these tended to be
abstract nouns (alegria, happiness; experiência, experience; opinião, opinion; etc.),
with concrete nouns being exclusively Brazilian (toalha, towel; seringa, syringe; talher,
cutlery; etc.).
7. Conclusion
We have presented our experience in ‘setting up for corpus lexicography’ for Portuguese,
including building a corpus from the web, cleaning it, removing duplicates, parsing
it, loading it into a corpus tool and preparing word sketches from it. We have also
presented an account of the two major varieties of the language as represented by the
two parts of the corpus that we collected.
References
Barlow, M. (2000), MonoConc Pro. Houston, TX: Athelstan.
Baroni, M., Bernardini, S., Ferraresi, A. and Zanchetta, E. (2009),‘e WaCky Wide Web:
A collection of very large linguistically processed Web-crawled corpora’.Journal of
Language Resources and Evaluation, 43, (3), 209–26.
Bick, E. (2000),e Parsing System PALAVRAS – Automatic Grammatical Analysis of
Portuguese in a Constraint Grammar Famework. PhD Dissertation, Århus University.
Christ, O. and Schulze, M. (1994), e IMS Corpus Workbench: Corpus Query Processor
(CQP). Stuttgart: University of Stuttgart.
Halliday, M. A. K. (1991), ‘Corpus studies and probabilistic grammar’, in K. Aijmer and B.
Altenberg (eds), English Corpus Linguistics: Studies in Honour of Jan Svartvik. London:
Longman, pp. 30–43.
Jakubíček, M., Kilgarri, A., Kovář, V., Rychlý, P. and Suchomel, V. (2013), ‘e TenTen
Corpus Family’, in Proceedings of the International Corpus Linguistics Conference
(Lancaster).
Kilgarri, A. and Renau, I. (2013), ‘esTenTen, a vast web corpus of Peninsular and
American Spanish’, in Proceedings of the V International Conference on Corpus
Linguistics (Alicante), pp. 12–19.
Kilgarri, A., Rychlý, P., Smrz, P. and Tugwell, D. (2004), ‘e Sketch Engine’, in
EURALEX Lorient Proceedings, pp. 105–15.
Rundell, M. and Kilgarri, A. (2011), ‘Automating the creation of dictionaries: Where
will it all end?’, in F. Meunier, S. De Cock, G. Gilquin and M. Paquot (eds), A Taste
for Corpora: In honour of Sylviane Granger. Amsterdam, Philadelphia, PA: John
Benjamins, pp. 257–81.
Santos, F., Freitas, C., Oliveira, H. and Carvalho, P. (2008), ‘Second HAREM: New
challenges and old wisdom’, in A. J. d. S. Teixeira, V. L. Strube de Lima, L. Caldas
9781441190505_txt_print.indd 126 02/12/2013 12:03
PtTenTen: A Corpus for Portuguese Lexicography 127
de Oliveira and P. Quaresma (eds), Proceedings of Computational Processing of the
Portuguese Language (PROPOR 2008), pp. 212–5.
Scott, M. (2008), WordSmith Tools. Liverpool: Lexical Analysis Soware.
Sharo, S. (2006), ‘Creating general-purpose corpora using automated search engine
queries’, in M. Baroni and S. Bernardini (eds), Wacky! Working Papers on Web as
Corpus. Bologna: Gedit, pp. 63–98.
Notes
1 http://code.google.com/p/chared
2 http://code.google.com/p/justext
3 http://code.activestate.com/recipes/326576 language detection using character
trigrams
9781441190505_txt_print.indd 127 02/12/2013 12:03
9781441190505_txt_print.indd 128 02/12/2013 12:03