Chapter

PtTenTen: A Corpus for Portuguese Lexicography

Authors:
  • Lexical Computing Ltd
  • Lexical Computing Ltd
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

There are many benefits to using corpora. In order to reap those rewards, how should someone who is setting up a dictionary project proceed? We describe a practical experience of such ‘setting up’ for a new Portuguese-English, English-Portuguese dictionary being written at Oxford University Press. We focus on the Portuguese side, as OUP did not have Portuguese resources prior to the project. We collected a very large (3.5 billion word) corpus from the web, including removing all unwanted material and duplicates. We then identified the best tools for Portuguese for lemmatizing and parsing, and undertook the very large task of parsing it. We then used the dependency parses, as output by the parser, to create word sketches (one page summaries of a word’s grammatical and collocational behavior). We plan to customize an existing system for automatically identifying good candidate dictionary examples, to Portuguese, and add salient information about regional words to the word sketches. All of the data and associated support tools for lexicography are available to the lexicographer in the Sketch Engine corpus query system.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... A plataforma SkecthEngine foi criada por Kilgarriff e seus colaboradores (cf. Kilgarriff et al., 2014) e permite pesquisas em corpus de vários níveis (lema, sintagma, palavra, entre outras). A Figura (4) representa a tela de pesquisa da plataforma. ...
... Tal corpus foi preparado para um novo dicionário da Oxford (português/inglês, inglês/português), a partir um sistema de corpora eletrônico, que tem como base produções linguísticas da internet, devido à riqueza de gêneros informais e próximos à fala. Para Kilgarriff et al.(2014), o sistema eletrônico permite que o processo de coleta de dados seja mais rápido e preciso. Os autores defendem que a web é a melhor forma para se obter um corpus de um bilhão de palavras, que é o objetivo dos autores, e para garantir a maior diversidade de tipos de texto. ...
... 5 2,7 milhões de palavras em 2013, quando nossa análise de corpus foi realizada. 6 Para que o corpus não possuísse palavras duplicadas e tivessem somente textos, os autores tomaram uma série de cuidados por meio de programação, que podem ser encontradas em Kilgarriff et al.(2014). ...
Thesis
Full-text available
For the last few years there has been a productive debate about the proper semantics of “weak definites” (Beyssade and Oliveira, 2013). Carlson and Sussman (2005) proposed that some definites, which they termed “weak definites” lack the uniqueness property that is a defining property of regular definite noun phrases. In this investigation, we focus on Carlson et al. (2013) who propose that weak definites refer to events in “incorporated” structures, and mainly on Aguilar-Guevara and Zwarts (2011, 2013), who propose that weak definites in an incorporated structures are a generic DP and the noun denotes a kind, which accounts for its lack of individual reference. While those authors focus on formal descriptons, the current research is an empirical investigation which was designed to evaluate the hypothesis that weak definites are interpreted as generics. We conducted a corpus analysis in Brazilian Portuguese (BP) and four experiments in American English (AE), with the goal of comparing weak definite, regular definite and generic definite interpretations. In the corpus analysis, 2196 definite phrase occurrences were analyzed. We observed the DP’s syntactic function (subject, object, adjunct) and the lexical aspect (activity, state, telic). As result, we present an interesting finding which is that, according to the categorization criteria employed here, weak DPs occur more than the generic ones. Another interesting finding is that the weak definites appear as subjects and that it appears as adjuncts as much as objects. The experiments were based on the idea that the generic hypothesis predicts that weak definites and generics should pattern together, the incorporation hypothesis predicts that they all should pattern differently from each other. We constructed 54 sentences with an event or activity verbal phrase, 18 of which had an object that could have a weak, generic or regular interpretation. Experiment 1 used a sentence judgement in which 90 MTurkers would evaluate if the nouns in the sentences could be judge as an individual or a category, and weak definites were more similar to regular definites, as individual, and generics as category. Experiment 2 used a forced choice task, with replication in BP, and generic definites show a significant preference for a new noun continuation, which differs from weak. In experiment 3, we ran a free completion task, and the proportion of continuations with the repetition of the target word were less frequent on the weak condition, because they are incorporated. Experiment 4 used a forced completion task, and the bare plural form was only used in the generic condition. We argue that all results support that generic definites present a different pattern from weak ones, constituting a category of definite noun phrases.
... De uma forma geral, o projeto VariaR baseia-se em corpora de naturezas diversas. Utilizam-se dados de fontes diversas (plataformas de gerenciamento de acervos de dados, como Sketch Engine 11 (KILGARRIFF et al., 2014), Corpus do Português 12 (criado por Mark Davies), diálogos de bancos de dados, jornais, revistas, redes sociais, memes, programas de TV, discurso político, obras da literatura, roteiros de cinema, filmes de animação, entre outras. ...
... De uma forma geral, o projeto VariaR baseia-se em corpora de naturezas diversas. Utilizam-se dados de fontes diversas (plataformas de gerenciamento de acervos de dados, como Sketch Engine 11 (KILGARRIFF et al., 2014), Corpus do Português 12 (criado por Mark Davies), diálogos de bancos de dados, jornais, revistas, redes sociais, memes, programas de TV, discurso político, obras da literatura, roteiros de cinema, filmes de animação, entre outras. ...
Book
Full-text available
... The corpus ptTenTen used for this study has a database of more than 2.7 million words, and was developed by Kilgarriff et al (2014) 13 . The corpus was built searching the language productions through the Web, due to the Internet richness of informal and speech-like genres. ...
Article
Full-text available
Este artigo tem por objetivo descrever a ocorrência do definido fraco (ex. Ana foi ao hospital), introduzido por Carlson e Sussman (2005), em corpus do português brasileiro (PB). Foram analisadas 400 ocorrências de 31 palavras que podem apresentar a leitura fraca em PB (ex. o hospital). Observamos se a palavra é determinada por um artigo definido, em seguida se leitura do DP é fraca (Carlson e Sussman, 2005), forte - ou regular - (Russell, 1905) ou genérica (Carlson, 2005). Além da leitura, analisamos a função sintática do DP (sujeito, objeto, adjunto). Como resultado, trazemos a distribuição dos definidos fracos em PB, além de realizarmos uma análise mais detalhada sobre os que ocupam a posição de sujeito.
Article
Full-text available
The Sketch Engine is a leading corpus tool, widely used in lexicography. Now, at 10 years old, it is mature software. The Sketch Engine website offers many ready-to-use corpora, and tools for users to build, upload and install their own corpora. The paper describes the core functions (word sketches, concordancing, thesaurus). It outlines the different kinds of users, and the approach taken to working with many different languages. It then reviews the kinds of corpora available in the Sketch Engine, gives a brief tour of some of the innovations from the last few years, and surveys other corpus tools and websites.
Article
Full-text available
Everyone working on general language would like their corpus to be bigger, wider-coverage, cleaner, duplicate-free, and with richer metadata. As a response to that wish, Lexical Computing Ltd. has a programme to develop very large ‘TenTen’ web corpora. In this paper we introduce the Spanish corpus, esTenTen, of 8 billion words and 19 different national varieties of Spanish. We investigate the distance between the national varieties as represented in the corpus, and examine in detail the keywords of Peninsular Spanish vs. American Spanish, finding a wide range of linguistic, cultural and political contrasts.
Article
Full-text available
Everyone working on general language would like their corpus to be bigger, wider-coverage, cleaner, duplicate-free, and with richer metadata. In this paper we describe out programme to build ever better corpora along these lines for all of the world’s major languages (plus some others). Baroni and Kilgarriff (2006), Sharoff (2006), Baroni et al (2009), and Kilgarriff et al (2010) present the case for web corpora and programmes in which a number of them have been developed. TenTens are a development from them -- a new family of corpora of the order of 10 billion words. We describe how we are building them, what we have built so far, and how we shall continue maintaining them and keeping them up to date in the years ahead. While, as yet, they have very little metadata, we are working out how to gather and add metadata attribute by attribute. The corpora are all available for research at http://www.sketchengine.co.uk.
Conference Paper
Full-text available
Discussion of the Second HAREM: changes to the guidelines, introduction of new tracks, improvement of evaluation measures and description of the new evaluation resources.
Article
Full-text available
This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC vs. the British National Corpus was also conducted, so as to highlight dierences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.
Article
Contents 1 Introduction 4 1.1 Corpora, Annotations, Queries, and Results . . . . . . . . . . . . . . . . . . . . . 4 1.2 Organization of this manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Basic interaction with CQP 6 2.1 Starting and leaving CQP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2 Selecting a corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 A simple query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Displaying a query result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4.1 Setting the AutoShow system variable . . . . . . . . . . . . . . . . . . . . . 8 2.4.2 Basic display method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.3 Changing the browsing method . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.4 Restricting the size of the r
Automating the creation of dictionaries: Where will it all end? A Taste for Corpora: In honour of Sylviane Granger
  • M Rundell
  • A Kilgarriff
Rundell, M. and Kilgarriff, A. (2011), ' Automating the creation of dictionaries: Where will it all end?', in F. Meunier, S. De Cock, G. Gilquin and M. Paquot (eds), A Taste for Corpora: In honour of Sylviane Granger. Amsterdam, Philadelphia, PA: John Benjamins, pp. 257–81.