
Maria das Graças Volpe NunesUniversity of São Paulo | USP · Institute of Mathematical and Computer Sciences (ICMC) (São Carlos)
Maria das Graças Volpe Nunes
About
164
Publications
16,413
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,570
Citations
Citations since 2017
Introduction
Publications
Publications (164)
Com o avanço da área de Processamento de Linguagem Natural (PLN), corpora são recursos que têm tido um lugar de destaque. Mais do que subsidiar estudos linguísticos, eles constituem as bases para o treinamento de modelos de Aprendizagem de Máquina e para o desenvolvimento de aplicações computacionais de ponta. Particularmente, há grande necessidade...
This paper presents the project of a large multi-genre treebank for Brazilian Portuguese, called Porttinari. We address relevant research questions in its construction and annotation, reporting the work already done. The treebank is affiliated with the “Universal Dependencies” international model, widely adopted in the area, and must be the basis f...
The large amount of data available in social media, forums and websites motivates researches in several areas of Natural Language Processing, such as sentiment analysis. The popularity of the area due to its subjective and semantic characteristics motivates research on novel methods and approaches for classification. Hence, there is a high demand f...
We report in this paper the coreference annotation process of the CSTNews corpus as part of a collective task of the IberEval 2017 conference. The annotated corpus is composed of 140 news texts written in Brazilian Portuguese language and counts with several annotation layers, including annotations in the morphosyntax/syntax, semantics, and discour...
Text normalization techniques based on rules, lexicons or supervised training requiring large corpora are not scalable nor domain interchangeable, and this makes them unsuitable for normalizing user-generated content (UGC). Current tools available for Brazilian Portuguese make use of such techniques. In this work we propose a technique based on dis...
Recently, spell checking (or spelling correction systems) has regained attention due to the need of normalizing user-generated content (UGC) on the web. UGC presents new challenges to spellers, as its register is much more informal and contains much more variability than traditional spelling correction systems can handle. This paper proposes two ne...
Em 1990 the Portuguese-speaking countries have signed an agreement on the reform of the Portuguese language orthography. The implementation of this reform was scheduled for the period between 2008 and 2012, subsequently postponed to 2016. In this work we describe the adaptation process of the Brazilian Portuguese dictionary embedded in the Unitex s...
This paper presents some results on lexicon-based classification of sentiment polarity in web reviews of products written in Brazilian Portuguese. They represent a first step towards a robust opinion miner from reviews of technology products. The evaluation shows the performance of 3 different sentiment lexicons combined with simple strategies. It...
O hunsrückisch constitui hoje a variedade de alemão mais falada no Brasil. Este trabalho tem como objetivo construir um corpus alinhado bilíngue alemão hunsrückischportuguês brasileiro, e a partir dele, obter um léxico bilíngue que possa ser utilizado na construção de um sistema de tradução automática estatística (SMT) entre as duas línguas. Apesar...
The number of citations received by authors in scientific journals has become a major parameter to assess individual researchers and the journals themselves through the impact factor. A fair assessment therefore requires that the criteria for selecting references in a given manuscript should be unbiased with regard to the authors or journals cited....
The realization that statistical physics methods can be applied to analyze written texts represented as complex networks has led to several developments in natural language processing, including automatic summarization and evaluation of machine translation. Most importantly, so far only a few metrics of complex networks have been used and therefore...
a b s t r a c t Establishing metrics to assess machine translation (MT) systems automatically is now crucial owing to the widespread use of MT over the web. In this study we show that such evaluation can be done by modeling text as complex networks. Specifically, we extend our previous work by employing additional metrics of complex networks, whose...
Topological and dynamic features of complex networks have proven in recent years to be suitable for capturing text characteristics, with various applications in natural language processing. In this article we show that texts with positive and negative opinions can be distinguished from each other when represented as complex networks. The distinctio...
Corpus-based techniques have proved to be very beneficial in the development of efficient and accurate approaches to word
sense disambiguation (WSD) despite the fact that they generally represent relatively shallow knowledge. It has always been
thought, however, that WSD could also benefit from deeper knowledge sources. We describe a novel approach...
Este artigo faz uma breve apresentação do Núcleo Interinstitucional de Linguística Computacional (NILC), que é um dos principais grupos brasileiros dedicado a pesquisas na área de Processamento de Línguas Naturais, particularmente do português brasileiro. Após apresentar um breve histórico de sua formação, mostramos como as atuais áreas de pesquisa...
The number of citations received by authors in scientific journals has become
a major parameter to assess individual researchers and the journals themselves
through the impact factor. A fair assessment therefore requires that the
criteria for selecting references in a given manuscript should be unbiased with
respect to the authors or the journals c...
Motivated by governmental, commercial and academic interests, and due to the growing amount of information, mainly online, automatic text summarization area has experienced an increasing number of researches and products, which led to a countless number of summarization methods. In this paper, we present a comprehensive comparative evaluation of th...
Due to idiosyncrasies in their syntax, semantics or frequency, Multiword Expressions (MWEs) have received special attention
from the NLP community, as the methods and techniques developed for the treatment of simplex words are not necessarily suitable
for them. This is certainly the case for the automatic acquisition of MWEs from corpora. A lot of...
Topological and dynamic features of complex networks have proven to be suitable for capturing text characteristics in recent years, with various applications in natural language processing. In this article we show that texts with positive and negative opinions can be distinguished from each other when represented as complex networks. The distinctio...
Identifying the correct sense of a word in context is crucial for many tasks in natural language processing (machine translation
is an example). State-of-the art methods for Word Sense Disambiguation (WSD) build models using hand-crafted features that
usually capturing shallow linguistic information. Complex background knowledge, such as semantic r...
A fusão de sentenças é uma tarefa que consiste em produzir, a partir de um conjunto de sentenças relacionadas, uma única sentença que resume as informações comuns apresentadas no conjunto. Essa tarefa é de grande interesse em diversas aplicações do Processamento de Língua Natural (PLN), tais como a Sumarização Automática, a Tradução Automática, os...
Automatic summarization of texts is now crucial for several information retrieval tasks owing to the huge amount of information available in digital media, which has increased the demand for simple, language-independent extractive summarization strategies. In this paper, we employ concepts and metrics of complex networks to select sentences for an...
This paper presents a Portuguese sentence fusion model. Sentence fusion is a text-to-text generation task which takes a set of similar sentences as input and combines these into a single output sentence. This process is of extreme relevance in many NLP applications, for instance, to treat redundancies in Multidocument Summarization by fusing inform...
Apresentamos neste artigo o processo de desenvolvimento e avaliação de um analisador discursivo automático para o português brasileiro. Seguindo a Teoria de Estruturação Retórica, o DiZer é um sistema simbólico baseado na ocorrência de marcadores textuais, fazendo uso de templates discursivos extraídos de um corpus de textos científicos para identi...
Motivated by governmental, commercial and academic interests, automatic text summarization area has experienced an increasing number of researches and products, which led to a countless number of summarization methods. In this paper, we present a comprehensive comparative evaluation of the main automatic text summarization methods based on rhetoric...
In this paper we present experiments concerned with automatically learning bilingual resources for machine translation: bilingual
dictionaries and transfer rules. The experiments were carried out with Brazilian Portuguese (pt), English (en) and Spanish (es) texts in two parallel corpora: pt–en and pt–es. They were designed to investigate the releva...
Identifying similar text passages plays an important role in many applications in NLP, such as paraphrase generation, automatic
summarization, etc. This paper presents some experiments on detecting and clustering similar sentences of texts in Brazilian
Portuguese. We propose an evalution framework based on an incremental and unsupervised clustering...
Complex networks have been increasingly used in text analysis, including in connection with natural language processing tools, as important text features appear to be captured by the topology and dynamics of the networks. Following previous works that apply complex networks concepts to text quality measurement, summary evaluation, and author charac...
Although it has been always thought that Word Sense Disambiguation (WSD) can be useful for Machine Translation, only recently efforts have been made towards integrating both tasks to prove that this assumption is valid, particularly for Statistical Machine Translation (SMT). While different approaches have been proposed and results started to conve...
This paper presents a freely available online lexical align-ment tool based on the LIHLA lexical aligner. LIHLA aligns tokens, words and multiword units based on language-independent heuristics (cognates, position, etc.) and auto-matically built language-dependent resources (bilingual dic-tionaries). VisualLIHLA allows the online usage, visualiza-t...
The availability of machine-readable bilingual linguistic resources is cru-cial not only for machine transla-tion but also for other applications such as cross-lingual information re-trieval. However, the building of such resources demands extensive manual work. This paper describes a methodology to build automatically bilingual dictionaries and tr...
This paper presents a modeling technique of texts as complex networks and the investigation of the correlation between the properties of such networks and author characteristics. In an experiment with several books from eight authors, we show that the networks produced for each author tend to have specific features, which indicates that complex net...
In this letter the authors discuss the relationship between structure and random walk dynamics in directed complex networks, with an emphasis on identifying whether a topological hub is also a dynamical hub. They establish the necessary conditions for networks to be topologically and dynamically fully correlated (e.g., word adjacency and airport ne...
We present a novel approach to the word sense disambiguation problem which makes use of corpus-based evidence com- bined with background knowledge. Em- ploying an inductive logic programming algorithm, the approach generates expres- sive disambiguation rules which exploit several knowledge sources and can also model relations between them. The ap-...
We describe an approach to the automatic crea-tion of a sense tagged corpus intended to train a word sense disambiguation (WSD) system for English-Portuguese machine translation. The ap-proach uses parallel corpora, translation diction-aries and a set of straightforward heuristics. In an evaluation with nine corpora containing 10 am-biguous verbs,...
This paper presents the challenge of Natural Language Processing, in particular, the case of Portuguese language in the scope of Computer Science and its disciplines. Questions related to natural language processing are associated to the challenges of knowledge access, information management in data intensive repositories, and the complex and inter...
Translation lexicons are one of the most important linguistic resources for machine translation. However, this bilingual set of word and multiword correspondences requires a lot of manual work to be built. This paper describes a method to automatically build translation lexicons. The lexicons are built by extracting knowledge from PoS-tagged and le...
We describe two systems participating of the English Lexical Sample task in SemEval- 2007. The systems make use of Inductive Logic Programming for supervised learning in two different ways: (a) to build Word Sense Disambiguation (WSD) models from a rich set of background knowledge sources; and (b) to build interesting features from the same knowled...
In this article we address the usefulness of linguistic-independent methods in extrac- tive Automatic Summarization, arguing that linguistic knowledge is not only useful, but may be necessary to improve the in- formativeness of automatic extracts. An as- sessment of four diverse AS methods on Brazilian Portuguese texts is presented to support our c...
We describe two systems participating of the English Lexical Sample task in SemEval-2007. The systems make use of Inductive Logic Programming for supervised learning in two different ways: (a) to build Word Sense Disambiguation (WSD) models from a rich set of background knowledge sources; and (b) to build interesting features from the same knowledg...
Previous efforts in complex networks research focused mainly on the topological features of such networks, but now also encompass the dynamics. In this Letter we discuss the relationship between structure and dynamics, with an emphasis on identifying whether a topological hub, i.e. a node with high degree or strength, is also a dynamical hub, i.e....
In this paper, we present and analyze the results of the application of a text summarization system – GistSumm – to the task
of monolingual question answering at CLEF 2006 for Portuguese texts. We hypothesized that topic-oriented summarization techniques
could be able to produce more accurate answers. However, our results showed that there is a big...
Feature engineering is known as one of the most important challenges for knowledge acquisition, since any inductive learning
system depends upon an efficient representation model to find good solutions to a given problem. We present an NLP-driven
constructive learning method for building features based upon noun phrases structures, which are suppo...
We propose a strategy to support Word Sense Disambigua- tion (WSD) which is designed speciflcally for multilingual applications, such as Machine Translation. Co-occurrence information extracted from the translation context, i.e., the set of words which have already been translated, is used to deflne the order in which disambiguation rules produced...
The identification of the correct sense of a word is neces- sary for many tasks in automatic natural language processing like ma- chine translation, information retrieval, speech and text processing. Au- tomatic Word Sense Disambiguation (WSD) is difficult and accuracies with state-of-the art methods are substantially lower than in other areas of t...
In spite of its potential for bidirectionality, Extensible Dependency Grammar (XDG) has so far been used almost exclusively for parsing. This paper represents one of the first steps towards an XDG-based inte-grated generation architecture by tackling what is arguably the most basic among generation tasks: lexicalization. Herein we present a constra...
The ability to access embedded knowledge makes complex networks extremely promising for natural language processing, which normally requires deep knowledge representation that is not accessible with first-order statistics. In this paper, we demonstrate that features of complex networks, which have been shown to correlate with text quality, can be u...
It is generally agreed that the ultimate goal of research into Word Sense Disambiguation (WSD) is to provide a technology
which can benefit applications; however, most of the work in this area has focused on the development of application-independent
models. Taking Machine Translation as the application, we argue that this strategy is not appropria...
This paper presents a summary evaluation method based on a complex network measure. We show how to model summaries as complex networks and establish a possible correlation between summary quality and the measure known as dynamics of the network growth. It is a generic and language independent method that enables easy and fast comparative evaluation...
This paper presents the review and evaluation of DiZer – an automatic discourse analyzer for Brazilian Portuguese. Based on
Rhetorical Structure Theory, DiZer is a symbolic analyzer that makes use of linguistic patterns learned from a corpus of scientific
texts to identify and build the discourse structure of texts. DiZer evaluation shows satisfact...
The availability of machine-readable bilingual linguistic resources is crucial not only for rule-based machine translation
but also for other applications such as cross-lingual information retrieval. However, the building of such resources (bilingual
single-word and multi-word correspondences, translation rules) demands extensive manual work, and,...
We present a statistical generative model for unsupervised learning of verb argument structures. The model was used to automatically induce the argument structures for the 1,500 most frequent verbs of English. In an evalua- tion carried out for a representative sample of verbs, more than 90% of the in- duced argument structures were judged correct...
We present a system that applies Argumentative Zoning (AZ) (Teufel and Moens, 2002), a method of determining argumentative
structure in texts, to the task of advising novice graduate writers on their writing. For this task, it is important to automatically
determine the rhetorical/argumentative status of a given sentence in the text. On the basis o...
While it is generally agreed that Word Sense Dis-ambiguation (WSD) is an application-dependent task, the great majority of systems pursue applica-tion-independent approaches. We propose a strat-egy to support WSD for Machine Translation which is designed specifically for this application. It relies on the analysis of co-occurrences in the context t...
This paper presents a statistical generative model for unsupervised learning of verb argument structures. The model is based on the noisy-channel model and is trained with the Expectation-Maximization algorithm. The model was used to induce the argument structures for the 1.500 most frequent verbs in English. The evaluation of a sample of this verb...
We investigate the use of ILP for the task of Word Sense Disambiguation (WSD) in two different ways: (a) as a stand-alone c onstructor of models for WSD; and (b) to build interesting features, which can then u sed by standard model-builder such as SVM. Experiments examining a multilingual WSD task in the context of English- Portuguese machine trans...
Since 1993, PROPOR Workshops have become an important forum for - searchers involved in the Computational Processing of Portuguese,both written and spoken. This PROPOR Workshop follows previous workshops held in 1993 (Lisbon, Portugal), 1996 (Curitiba, Brazil), 1998 (Porto Alegre, Brazil), 1999 ´ (Evora, Portugal), 2000 (Atibaia, Brazil) and 2003 (...
In this paper we describe LIHLA, a lexical aligner which uses bilingual probabilistic lexicons generated by a freely available set of tools (NATools) and language-independent heuristics to find links between single words and multiword units in sentence-aligned parallel texts. The method has achieved an alignment error rate of 22.72% and 44.49% on E...
This work documents the project and development of various computational linguistic resources that support the Brazilian Portuguese language according to the formal methodology used by the corpus processing system called UNITEX. The delivered resources include computational lexicons, libraries to access compressed lexicons, and additional tools to...
Concepts of complex networks have been used to obtain metrics that were correlated to text quality established by scores assigned by human judges. Texts produced by high-school students in Portuguese were represented as scale-free networks (word adjacency model), from which typical network features such as the in/outdegree, clustering coefficient a...
Concepts of complex networks have been used to obtain metrics that were correlated to text quality established by scores assigned by human judges. Texts produced by high-school students in Portuguese were represented as scale-free networks (word adjacency model), from which typical network features such as the in/outdegree, clustering coefficient a...
This paper focuses on how multiparadigm – namely, constraint, object-oriented and higher-order – programming can be drawn
upon not only to specify multiparameterized linguistic realization engines but also and above all to rationalize their configuration
into full-fledged generation modules for specific language-application pairs. We describe Manat...
This paper presents a statistical generative model for unsupervised learning of verb argument structures. The model is based on the noisy-channel model and is trained with the Expectation-Maximization algorithm. The model was used to induce the argument structures for the 1.500 most frequent verbs in English. The evaluation of a sample of this verb...
This paper describes the automatic generation and the evaluation of sets of rules for word sense disambiguation (WSD) in machine translation. The ultimate aim is to identify high-quality rules that can be used as knowledge sources in a relational WSD model. The evaluation was carried out both automatically, by means of four objective measures (erro...