
Viviane MoreiraUniversidade Federal do Rio Grande do Sul | UFRGS · Institute of Informatics
Viviane Moreira
Associate Professor
About
71
Publications
8,142
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
769
Citations
Citations since 2017
Introduction
Skills and Expertise
Publications
Publications (71)
In the context of sentiment analysis, there has been growing interest in performing a finer granularity analysis focusing on the specific aspects of the entities being evaluated. This is the goal of Aspect-Based Sentiment Analysis (ABSA) which basically involves two tasks: aspect extraction and polarity detection. The first task is responsible for...
Optical character recognition (OCR) is typically used to extract the textual contents of scanned texts. The output of OCR can be noisy, especially when the quality of the scanned image is poor, which in turn can impact downstream tasks such as information retrieval (IR). Post-processing OCR-ed documents is an alternative to fix digitization errors...
The standard approach for abstractive text summarization is to use an encoder-decoder architecture. The encoder is responsible for capturing the general meaning from the source text, and the decoder is in charge of generating the final text summary. While this approach can compose summaries that resemble human writing, some may contain unrelated or...
This work describes a visual analysis platform to explore randomized clinical trials applied to the treatment or prevention of COVID-19 to map the treatments or adverse reactions observed in such interventions.
A significant portion of the textual information of interest to an organization is stored in PDF files that should be converted into plain text before their contents can be processed by an information retrieval or text mining system. When the PDF documents consist of scanned documents, optical character recognition (OCR) is typically used to extrac...
Over the last decades, oil and gas companies have been facing a continuous increase of data collected in unstructured textual format. New disruptive technologies, such as natural language processing and machine learning, present an unprecedented opportunity to extract a wealth of valuable information within these documents. Word embedding models ar...
This work addresses the problem of identifying and fusing duplicate features in machine learning datasets. Our goal is to evaluate the hypothesis that fusing duplicate features can improve the predictive power of the data while reducing training time. We propose a simple method for duplicate detection and fusion based on a small set of features. An...
In the context of sentiment analysis, there has been growing interest in performing a finer granularity analysis focusing on the specific aspects of the entities being evaluated. This is the goal of Aspect-Based Sentiment Analysis (ABSA) which basically involves two tasks: aspect extraction and polarity detection. The first task is responsible for...
This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity. We propose an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. Our hypothesis is that evidence from addit...
BERT (Bidirectional Encoder Representations from Transformers) and ALBERT (A Lite BERT) are methods for pre-training language models which can later be fine-tuned for a variety of Natural Language Understanding tasks. These methods have been applied to a number of such tasks (mostly in English), achieving results that outperform the state-of-the-ar...
A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is needed. In this empirical study, we simulate OCR err...
Context
Determining which patients are ready for discharge from an Intensive Care Unit (ICU) presents a huge challenge, as ICU readmissions are associated with several negative outcomes such as increased mortality, length of stay, and cost compared to those patients who are not readmitted during their hospital stay. For these reasons, enhancing ris...
In the last few years, there has been growing interest in aspect-based sentiment analysis, which deals with extracting, clustering, and rating the overall opinion about the features of the entity being evaluated. Techniques for aspect extraction can produce an undesirably large number of aspects — with many of those relating to the same product fea...
The increasing availability of rating datasets (i.e., datasets containing user evaluations on items such as products and services) constitutes a new opportunity in various applications ranging from behavioral analytics to recommendations. In this paper, we describe the design of VugA, a visual enabler for the exploration of rating data and user gro...
The increasing availability of user data constitutes new opportunities in various applications ranging from behavioral analytics to recommendations. A common way of analyzing user data is through "user group analytics" whose purpose is to breakdown users into groups to gain a more focused understanding of their collective behavior. The process cons...
Offensive posts are a constant nuisance in many Web platforms. As a consequence, there has been growing interest in devising methods to automatically identify such posts. In this paper, we present Hate2Vec -- an approach for detecting offensive comments on the Web. Hate2Vec relies on a classifier ensemble. The base learners include: (i) a lexicon-b...
The volume of information which contemporary student has access is huge, and they are surrounded by smartphones, tablets, internet and computers in an almost inseparable way from their daily life. Therefore, we think that it is important that schools should consider the insertion of technological instruments in the classroom to attend the students’...
The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development o...
We introduce VEXUS, an interactive visualization framework for exploring user data to fulfill tasks such as finding a set of experts, forming discussion groups and analyzing collective behaviors. User data is characterized by a combination of demographics like age and occupation, and actions such as rating a movie, writing a paper, following a medi...
This work proposes duelmerge, a stable merging algorithm that is asymptotically optimal in the number of comparisons and performs O(nlog2(n)) moves. Unlike other partition-based algorithms, we only allow blocks of equal sizes to be swapped, which reduces the number of moves required. We performed experiments comparing duelmerge against a number of...
Brazilian Web users are among the most active in social networks and very keen on interacting with others. Offensive comments, known as hate speech, have been plaguing online media and originating a number of lawsuits against companies which publish Web content. Given the massive number of user generated text published on a daily basis, manually fi...
The importance of emotion mining is acknowledged in a wide range of new applications, thus broadening the potential market already proven for opinion mining. However, the lack of resources for languages other than English is even more critical for emotion mining. In this article, we investigate whether Multilingual Sentiment Analysis delivers relia...
Social networks such as Twitter are used by millions of people who express their opinions on a variety of topics. Consequently, these media are constantly being examined by sentiment analysis systems which aim at classifying the posts as positive or negative. Given the variety of topics discussed and the short length of the posts, the standard appr...
The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval systems. In this article, we evaluate various stemming algorithms, in four languages, in terms of accuracy and in terms of their ai...
The vast amount of scientific publications available online makes it easier for students and researchers to reuse text from other authors and makes it harder for checking the originality of a given text. Reusing text without crediting the original authors is considered plagiarism. A number of studies have reported the prevalence of plagiarism in ac...
A significant part of the information available on the Web is stored in online databases which compose what is known as Hidden Web or Deep Web. In order to access information from the Hidden Web, one must fill an HTML form that is submitted as a query to the underlying database. In recent years, many works have focused on how to automate the proces...
Authorship analysis aims at classifying texts based on the stylistic choices of their authors. The idea is to discover characteristics of the authors of the texts. This task has a growing importance in forensics, security, and marketing. In this work, we focus on discovering age and gender from blog authors. With this goal in mind, we analyzed a la...
Most scientific articles are available in PDF format. The PDF standard allows the generation of metadata that is included within the document. However, many authors do not define this information, making this feature unreliable or incomplete. This fact has been motivating research which aims to extract metadata automatically. Automatic metadata ext...
Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domainspecific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we c...
The state-of-the-art in domain-specific Web form discovery relies on supervised methods requiring substantial human effort in providing training examples, which limits their applicability in practice. This paper proposes an effective alternative to reduce the human effort: obtaining high-quality domain-specific training forms. In our approach, the...
Research in external plagiarism detection is mainly concerned with the comparison of the textual contents of a suspicious document against the contents of a collection of original documents. More recently, methods that try to detect plagiarism based on citation patterns have been proposed. These methods are particularly useful for detecting plagiar...
The discovery of HTML query forms is one of the main challenges in Deep Web crawling. Automatic solutions for this problem perform two main tasks. The first is locating HTML forms on the Web, which is done through the use of traditional/focused crawlers. The second is identifying which of these forms are indeed meant for querying, which also typica...
Wikipedia has emerged as an important source of structured information on the Web. But while the success of Wikipedia can be attributed in part to the simplicity of adding and modifying content, this has also created challenges when it comes to using, querying, and integrating the information. Even though authors are encouraged to select appropriat...
In recent years, several methods and tools been developed together with test collections to aid in plagiarism detection. However, both methods and collections have focused on content analysis, overlooking citation analysis. In this paper, we aim at filling this gap and present a test collection with cases of plagiarism by missing and incorrect refe...
Recent research has taken advantage of Wikipedia's multilingualism as a
resource for cross-language information retrieval and machine translation, as
well as proposed techniques for enriching its cross-language structure. The
availability of documents in multiple languages also opens up new opportunities
for querying structured Wikipedia content, a...
One of the main tasks in Information Retrieval is to match a user query to the documents that are relevant for it. This matching is challenging because in many cases the keywords the user chooses will be different from the words the authors of the relevant documents have used. Throughout the years, many approaches have been proposed to deal with th...
Several advanced data management applications, such as data integration, data deduplication, and similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to decide whether two different data instances match, i.e., if they represent the same real world object. In...
The extensive use of Multiword Expressions (MWE) in natural language texts prompts more detailed studies that aim for a more adequate treatment of these expressions. A MWE typically expresses concepts and ideas that usually cannot be expressed by a single word. Intuitively, with the appropriate treatment of MWEs, the results of an Information Retri...
This paper presents a new method for Cross-Language Plagiarism Analysis. Our task is to detect the plagiarized passages in the suspicious documents and their corresponding fragments in the source documents. We propose a plagiarism detection method composed by five main phases: language normalization, retrieval of candidate documents, classifier tra...
The quality of stemming algorithms is typically measured in two different ways: (i) how accurately they map the variant forms
of a word to the same stem; or (ii) how much improvement they bring to Information Retrieval. In this paper, we evaluate different
Portuguese stemming algorithms in terms of accuracy and in terms of their aid to Information...
Ranking groups of researchers is important in several contexts and can serve many purposes such as the fair distribution of grants based on the scientist's publication output, concession of research projects, classification of journal editorial boards and many other applications in a social context. In this paper, we propose a method for measuring...
The goal of approximate data matching is to assess whether two distinct data instances represent the same real world object. This is usually achieved through the use of a similarity function, which returns a score that defines how similar two data instances are. If this score surpasses a given threshold, both data instances are considered as repres...
We introduce a problem called maximum common characters in blocks (MCCB), which arises in applications of approximate string comparison, particularly in the unification of possibly erroneous textual data coming from different sources. We show that this problem is NP-complete, but can nevertheless be solved satisfactorily using integer linear progra...
For UFRGS’s participation on CLEF’s Robust task, our aim was to compare retrieval of plain documents to retrieval using information
on word senses. The experimental runs which used word- sense disambiguation (WSD) consisted in indexing the synset codes of
the senses which had scores higher than a predefined threshold. Several thresholds were tested...
This paper proposes the use of algorithms for mining association rules as an approach for Cross-Language Information Retrieval. These algorithms have been widely used to analyse market basket data. The idea is to map the problem of finding associations between sales items to the problem of finding term translations over a parallel corpus. The propo...
For UFRGS’s participation on the TEL task at CLEF2008, our aim was to assess the validity of using algorithms for mining association
rules to find mappings between concepts on a Cross-Language Information Retrieval scenario. Our approach requires a sample
of parallel documents to serve as the basis for the generation of the association rules. The r...
Several advanced data management applications, such as data integration, data deduplication or similarity querying rely on the application of similarity functions. A similarity function requires the definition of a threshold value in order to assess if two different data instances match, i.e., if they represent the same real world object. In this c...
This paper presents a method for assessing the quality of similarity functions. The scenario taken into account is that of approximate data matching, in which it is necessary to determine whether two data instances represent the same real world object. Our method is based on the semi-automatic estimation of optimal threshold values. We propose two...
Approximate data matching applications typically use similarity functions to quantify the degree of like- ness between two data instances. There are several similarity functions available, thus, it is often neces- sary to evaluate a number of them aiming at choosing the function that is more adequate to a specific appli- cation. This paper presents...
For UFRGS's first participation on CLEF our goal was to compare the performance of heavier and lighter stemming strateg ies using the Portuguese data collections for Monolingual Ad-hoc retrieval. The results show that the safest strategy was to use the lighter alternative (reducing plural forms only). On a query-by-query analysis, full stemming ach...
This paper presents a study of relevance feedback in a cross-language information retrieval environment. We have performed an experiment in which Portuguese speakers are asked to judge the relevance of English documents; documents hand-translated to Portuguese and documents automatically translated to Portuguese. The goals of the experiment were to...
Simulated networks of spiking leaky integrators are used to categorise and for Information Retrieval (IR). Neurons in the network are sparsely connected, learn using Hebbian learning rules, and are simulated in discrete time steps. Our earlier work has used these models to simulate human concept formation and usage, but we were interested in the mo...
This paper reports the work of Middlesex University in the CLEF bilingual task. We have carried out experiments using Portuguese
queries to retrieve documents in English. The approach used was Latent Semantic Indexing, which is an automatic method not
requiring dictionaries or thesauri. We have also run a monolingual version of the system to work a...
The conceptual schema (intention) and raw data (extension) are evolving entities which require adequate support for past, present and even future versions. Temporal Databases supporting schema evolution were developed with the aim of satisfying this need. The support for schema versioning raises two complex subjects: the storage of the several sche...
This paper reports the work of Middlesex University for the CLEF bilingual task. We have carried out experiments using Portuguese queries to retrieve documents in English. The approach used was Latent Semantic Indexing, which is an automatic method not requiring dictionaries or thesauri. We describe the methods used along with an analysis of the re...
Raw data and database structures are evolving entities that require adequate support for past, present and even future versions. Temporal databases supporting schema versioning were developed with the aim of satisfying this requirement. This paper considers a generalized temporal database system, which provides support for time at both intensional...
The conceptual schema (intention) and raw data (extension) are evolving entities which require adequate support for past, present and even future versions. Temporal Databases supporting schema evolution were developed with the aim of satisfying this need. The support for schema versioning raises two complex subjects: the storage of the several sche...
For UFRGS's participation on CLEF's Robust task, our aim was to assess the benefits of identifying and indexing Multiword Expressions (MWEs) for Information Retrieval. The approach used for MWE identification was totally statistical, based association measures such as Mutual Information and Chi-square. Contradicting our results on the training topi...
The Wikipedia is an online encyclopedia available in about 200 lan- guages. Its Portuguese version currently contains over 200 thousand articles. If we consider each Wikipedia article as a vertex and each link as an arc, we have what we call a "Wikigraph". This graph differs from other Web mainly graphs because it has temporal information associate...
This paper applies the Cell Assemblies (CAs) model to Information Retrieval. CAs are reverberating circuits of neurons that can persist long beyond the initial stimulus has ceased. CAs are learned through Hebbian learning rules and have been used to simulate the formation and the usage of human concepts. We adapted the CAs model to learn relationsh...
Dissertação (Mestrado)--Instituto de Informática da Univeridade Federal do Rio Grande do Sul.