Marcos André Gonçalves

Marcos André Gonçalves
  • PhD, Computer Science
  • Professor (Associate) at Federal University of Minas Gerais

About

450
Publications
87,475
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,482
Citations
Current institution
Federal University of Minas Gerais
Current position
  • Professor (Associate)
Additional affiliations
February 2005 - present
Federal University of Minas Gerais
Position
  • Professor (Associate)

Publications

Publications (450)
Preprint
Full-text available
This study delves into the mechanisms that spark user curiosity driving active engagement within public Telegram groups. By analyzing approximately 6 million messages from 29,196 users across 409 groups, we identify and quantify the key factors that stimulate users to actively participate (i.e., send messages) in group discussions. These factors in...
Article
Full-text available
We investigate two essential challenges in the context of Hierarchical Topic Modeling (HTM)—(i) the impact of data representation and (ii) topic evaluation. The data representation directly influences the performance of the topic generation, and the impact of new representations such as contextual embeddings in this task has been under-investigated...
Article
Full-text available
Understanding why a trained machine learning model makes some decisions is paramount to trusting the model and applying its recommendations in real-world applications. In this article, we present the design and development of an interactive and visual approach to support the use, interpretation and refinement of ML models, whose development was gui...
Conference Paper
Full-text available
Resumo. Este estudo visa apoiar estratégias de inteligência artificial (IA) ao unir duas bases de dados provenientes de dois grandes estudos em saúde realizados em Belo Horizonte: o Inquérito de Saúde da Região Metropolitana de Belo Horizonte (RMBH) e o Vigitel. Essa integração visa subsidiar algoritmos aplicados ao estudo de prevalência de doenças...
Conference Paper
Full-text available
Resumo. Nosso estudo é focado em custos hospitalares de pacientes com covid-19, em que parte significativa desses custos está associada à internação em unidade de terapia intensiva (UTI). Estudos anteriores revelaram que os modelos preditivos de UTI para pacientes com covid-19 apresentaram limitações de efetividade. Diante desse cenário, o objetivo...
Conference Paper
Full-text available
Resumo: Um dos principais desafios para construção de modelos preditivos efetivos é a quantidade de dados insuficientes para o processo de aprendizado. Um exemplo desse cenário é a área de predição de custos hospitalares de pacientes devido à sensitividade desses dados. Nesse contexto, estratégias baseadas em Generative Adversarial Networks (GANs)...
Conference Paper
Full-text available
This study introduces the DIS (Detection, Initial Characterization, Semantic Characterization) methodology to analyze and understand temporal shifts in healthcare data. By applying this novel methodology to the Brazilian COVID-19 Registry and Medical Information Mart for Intensive Care (MIMIC-IV) datasets, the research demonstrates its effectivenes...
Conference Paper
Full-text available
Este estudo propõe uma estratégia de agrupamento hierárquico de setores censitários para estimar desfechos de saúde em pequenas áreas, por meio da utilização de dados do Censo Demográfico de 2010, realizado pelo IBGE (Instituto Brasileiro de Geografia e Estatística) e do Vigitel-Vigilância de Fatores de Risco e Proteção para Doenças Crônicas por In...
Conference Paper
Full-text available
Resumo: A seleção de métricas de avaliação adequadas ao problema é essencial para avaliar o desempenho de modelos preditivos. Este estudo objetivou comparar métricas tradicionalmente utilizadas na avaliação de modelos preditivos com métricas por classe, e demonstrar como uma avaliação considerando um conjunto de métricas complementares pode produzi...
Conference Paper
Full-text available
During the COVID-19 pandemic, adaptations in the provision of medical services were implemented with the primary objective of reducing contagion risks. Such procedures, combined with other factors, may have generated changes in the costs of private health operators, which could impact their business models due to the lack of predictability. In this...
Conference Paper
Full-text available
Resumo. O presente trabalho está inserido em um projeto de inovação que propõe o desenvolvimento de um assistente de IA voltado para a personalização do cuidado de saúde. Esse assistente é baseado em um modelo de linguagem que explora dados do histórico de mensagens trocadas entre indivíduos e equipe de saúde, prontuário eletrônico, determinantes s...
Conference Paper
Full-text available
Resumo: A pandemia por covid-19 representou um desafio para os sistemas de saúde em todo o mundo. A maioria dos estudos realizados carecem de uma análise detalhada do impacto das mudanças temporais sobre a força de associação entre diferentes preditores e importantes desfechos clínicos. Nesse contexto, o presente estudo investiga o impacto de carac...
Conference Paper
Full-text available
O crescente volume de dados em repositórios de reclamações de consumidores impõe desafios significativos para a gestão eficaz dessas informações. Dentre estes desafios destaca-se o fato de que muitas reclamações são registradas mais de uma vez, por um mesmo consumidor, para pressionar as empresas, o que pode impactar a gestão desses registros e dis...
Conference Paper
Full-text available
Grandes Modelos de Linguagem (GMLs), baseados em técnicas de Inteligência Artificial (IA), tem revolucionado o Processamento de Lingua-gem Natural (PLN), sendo considerados o estado-da-arte em diversas tare-fas práticas de PLN tais como classificação de texto, análise de sentimentos, sumarização de textos, e sistemas de perguntas-e-respostas. No en...
Article
Fine-tuning transformer-based deep learning models is currently at the forefront of natural language processing (NLP) and information retrieval (IR) tasks. However, fine-tuning these transformers for specific tasks, especially when dealing with ever-expanding volumes of data, constant retraining requirements, and budget constraints, can be computat...
Conference Paper
Full-text available
Automatic Text Classification (ATC) in unbalanced datasets is a common challenge in real-world applications. In this scenario, one (or more) class(es) is overrepresented, which usually causes a bias in the learning process towards these majority classes. This work investigates the effect of undersampling methods, which aim to reduce instances of th...
Conference Paper
This Ph.D. dissertation focused on proposing, designing and evaluating a novel textual document representation that exploits the “best of two worlds”: efficient and effective frequentist information (TFIDF representations) with semantic information derived from word embedding representations. In more details, our proposal – called CluWords – groups...
Conference Paper
This master thesis proposes the RiskLoss function to deal with the (hard) problem of incorporating risk-sensitiveness measures into Deep Neural Networks (DNNs), by including two adaptations for neural network ranking in ad-hoc retrieval and Recommender Systems (RSs): a differentiable loss function and the use of networks‘ sub-portions, obtained via...
Conference Paper
Esta tese de doutorado tem como foco a proposta, concepção e avaliação de uma nova representação textual de documentos que combina o “melhor de dois mundos”: a informação frequentista, eficiente e eficaz (representações TFIDF), com informações semânticas derivadas de representações de word embeddings. Especificamente, nossa proposta — denominada Cl...
Article
Full-text available
Background COVID-19 vaccines effectively prevent infection and hospitalization. However, few population-based studies have compared the clinical characteristics and outcomes of patients hospitalized for COVID-19 using advanced statistical methods. Our objective is to address this evidence gap by comparing vaccinated and unvaccinated patients hospit...
Preprint
Full-text available
Ph.D. Dissertation presented to the Graduate Program in Computer Science of the Federal University of Minas Gerais in partial fulfillment of the requirements for the degree of Doctor in Computer Science. UNIVERSIDADE FEDERAL DE MINAS GERAIS Instituto de Ciências Exatas Programa de Pós-Graduação em Ciência da Computação
Preprint
Full-text available
Transformer models have achieved state-of-the-art results, with Large Language Models (LLMs), an evolution of first-generation transformers (1stTR), being considered the cutting edge in several NLP tasks. However, the literature has yet to conclusively demonstrate that LLMs consistently outperform 1stTRs across all NLP tasks. This study compares th...
Preprint
Full-text available
This is the first work to investigate the effectiveness of BERT-based contextual embeddings in active learning (AL) tasks on cold-start scenarios, where traditional fine-tuning is infeasible due to the absence of labeled data. Our primary contribution is the proposal of a more robust fine-tuning pipeline - DoTCAL - that diminishes the reliance on l...
Conference Paper
Full-text available
The ability to represent data in meaningful and tractable ways is crucial for Natural Language Processing (NLP) applications. This Ph.D. dissertation focused on proposing, designing and evaluating a novel textual document representation that exploits the “best of two worlds”: efficient and effective frequentist information (TFIDF representations) w...
Article
Full-text available
The challenge of constructing effective sentiment models is exacerbated by a lack of sufficient information, particularly in short texts. Enhancing short texts with semantic relationships becomes crucial for capturing affective nuances and improving model efficacy, albeit with the potential drawback of introducing noise. This article introduces a n...
Article
Scientific authors’ collaborations are influenced by various factors, such as their field, geographic region, and institutional role. Here we focus on a group of authors whose patterns of publications greatly deviate from the average, previously referred as hyperprolific authors. Prior studies have investigated the emergence of hyperprolific author...
Article
Full-text available
Contrarian groups, notably Intellectual Dark Web, Alt-lite, and Alt-right, are present across the Web, ranging from fringe websites to mainstream social media. Such massive presence raises major concerns as contrarians often engage in the spread of conspiracy theories and hate speech toward particular groups of people. Historically, there is a gene...
Preprint
Full-text available
BACKGROUND Healthcare data is a valuable resource for improving patient’s outcomes. If adequately treated and interpreted, it can enhance healthcare services and help to understand the impacts of new technologies and treatments. One important aspect of healthcare data is that it is usually temporal, in the sense that it is collected over time and i...
Article
Full-text available
Background Proper analysis and interpretation of health care data can significantly improve patient outcomes by enhancing services and revealing the impacts of new technologies and treatments. Understanding the substantial impact of temporal shifts in these data is crucial. For example, COVID-19 vaccination initially lowered the mean age of at-risk...
Article
Based on openness and transparency for good governance, unimpeded and verifiable access to legal and regulatory information is essential. With such access, we can monitor government actions to ensure that public financial resources are not improperly or inconsistently used. This facilitates, for example, the detection of unlawful behavior in public...
Article
Transformer architectures have become the main component of various state-of-the-art methods for natural language processing tasks, such as Named Entity Recognition and Relation Extraction (NER+RE). As these architectures rely on semantic (contextual) aspects of word sequences, they may fail to accurately identify and delimit entity spans when ther...
Article
Full-text available
Background Acute kidney injury has been described as a common complication in patients hospitalized with COVID-19, which may lead to the need for kidney replacement therapy (KRT) in its most severe forms. Our group developed and validated the MMCD score in Brazilian COVID-19 patients to predict KRT, which showed excellent performance using data fro...
Conference Paper
Full-text available
Neste artigo, abordamos a tarefa de Reconhecimento de Entidades Nomeadas (REN) nos casos de Organizações e Produtos/Serviços presentes em reclamações textuais registradas em plataformas na Web. Devido ao alto poder de inferência dos modelos de linguagem de larga escala (LLM's), há interesse crescente em sua aplicação, porém eles enfrentam problemas...
Conference Paper
Full-text available
Modelos de stacking são efetivos na tarefa de classificação automática de documentos explorando a complementariedade entre modelos. Contudo, ainda há situações de falha na classificação de alguns documentos, denominados aqui como documentos difíceis, devido a um viés em que a maioria dos modelos aprendidos apontam para uma classe diferente da real....
Conference Paper
Nowadays Neural Network algorithms have excelled in Automatic Text Classification (ATC). However, such enhanced performance comes at high computational costs. Stacking of simpler classifiers that exploit algorithmic and representational complementarity has also been shown to produce superior performance in ATC, enjoying high effectiveness and poten...
Conference Paper
Essa dissertação explora o uso do Aprendizado Federado para Ranqueamento (Federated Learning to Rank - FL2R), uma técnica empregada em sistemas de busca que considera a privacidade dos dados de diversos clientes. O FL2R envolve a construção de um modelo de ranqueamento executado de forma distribuída em vários dispositivos. Após o treino, os parâmet...
Conference Paper
Extreme multi-label text classification (XMTC) involves assigning relevant labels to text from a huge space of labels. Addressing the core challenges of XMTC (volume, imbalance and quality), we propose xCoRetriev, a two-stage pipeline migrating from a classification perspective to an information retrieval (IR) approach. We address the volume challe...
Conference Paper
Automatic text classification in Natural Language Processing (NLP) and the task of predicting classes for textual documents. Traditionally, there have been two predominant approaches: bag-of-words-based models and more recent sequence-based models. While bag-of-words-based models represent documents considering only the occurrence of individual ter...
Article
The literature has not fully and adequately explained why contextual (e.g., BERT-based) representations are so successful to improve the effectiveness of some Natural Language Processing tasks, especially Automatic Text Classifications (ATC). In this article, we evince that such representations, when properly tuned to a target domain, produce an ex...
Conference Paper
Full-text available
Since the global outbreak of the coronavirus 2019 pandemic, hundreds of works have been published, analyzing and modeling multiple aspects of the disease. Several of them venture into predictive and modeling tasks, such as mortality prediction and patient severity scoring, using machine-learning (ML) algorithms. An important limitation for most of...
Article
In this article we study and characterize the phenomenon of the hyperprolific authors, who are the most productive researchers according to a given repository in a specific period of time. Particularly, we are interested in investigating and characterizing a subset of such hyperprolific authors who present a sudden growth in the number of published...
Article
Full-text available
The majority of early prediction scores and methods to predict COVID-19 mortality are bound by methodological flaws and technological limitations (e.g., the use of a single prediction model). Our aim is to provide a thorough comparative study that tackles those methodological issues, considering multiple techniques to build mortality prediction mod...
Article
Progress in Natural Language Processing (NLP) has been dictated by the rule of more : more data, more computing power, more complexity, best exemplified by Deep Learning Transformers. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. One way to ameliorate thi...
Article
Full-text available
High recall Information REtrieval (HIRE) aims at identifying only and (almost) all relevant documents for a given query. HIRE is paramount in applications such as systematic literature review, medicine, legal jurisprudence, among others. To address the HIRE goals, active learning methods have proven valuable in determining informative and non-redun...
Article
Full-text available
Forecasting is of utmost importance for the Tourism Industry. The development of models to predict visitation demand to specific places is essential to formulate adequate tourism development plans and policies. Yet, only a handful of models deal with the hard problem of fine-grained (per attraction) tourism demand prediction. In this paper, we argu...
Article
Full-text available
The way Complex Machine Learning (ML) models generate their results is not fully understood, including by very knowledgeable users. If users cannot interpret or trust the predictions generated by the model, they will not use them. Furthermore, the human role is often not properly considered in the development of ML systems. In this article, we pres...
Article
Introduction: Venous thromboembolism (VTE) has a significantly higher incidence in COVID-19 patients when compared to other acute viral infections. The evidence on the risk factors of VTE in COVID-19 inhospital patients is still inconsistent. The information is of utmost importance, as a path to promote prevention, early diagnosis and treatment. Hy...
Article
We tackle the problem of learning classification models with very small amounts of labeled data (e.g., less than 10% of the dataset) by introducing a novel Single View Co-Training strategy supported by Reinforcement Learning (CoRL). CoRL is a novel semi-supervised learning framework that can be used with a single view (representation). Differently...
Chapter
The rise in mis/disinformation and abusive language online is alarming. These problems threaten society, impacting users’ mental health and even politics and democracy. Social science studies have already theorized about those problems’ mutual spread, for instance, regarding how users interact with mis/disinformation. In this work, we propose to an...
Conference Paper
Full-text available
Arquiteturas neurais baseadas em transformers tornaram-se o principal componente de vários métodos do estado-da-arte em tarefas de processamento de linguagem natural, tais como Reconhecimento de Entidades Nomeadas e Extração de Relações (REN+ER). Como essas arquiteturas baseiam-se em aspectos semânticos de sequências de palavras, elas podem não fun...
Conference Paper
Full-text available
Deduplicação de registros (DR) tem como objetivo identificar instâncias que representam a mesma entidade do mundo real em repositórios de dados. No ambiente governamental, o processo de DR facilita a identificação de irregularidades e reduz o consumo de recursos computacionais em tarefas de integração de dados. Nesse contexto, propomos neste artigo...
Conference Paper
Full-text available
Acesso irrestrito e monitorável a leis e regulamentações é pressuposto essencial da democracia. Isso permite, por exemplo, a detecção de ilícitos e o monitoramento de fraudes em ações públicas (e.g., licitações). Contudo, cada ente federado segue seus próprios critérios de padronização de modelos e formato na disponibilização dessas informações, po...
Conference Paper
Full-text available
Nowadays, neural networks algorithms, such as those based on Attention and Transformers, have excelled on Automatic Text Classification (ATC). However, such enhanced performance comes at high computational costs. Stacking of simpler classifiers that exploit algorithmic and representational complementarity has also been shown to produce superior per...
Article
Full-text available
Recent efforts have focused on identifying multidisciplinary teams and detecting co-Authorship Networks based on exploring topic modeling to identify researchers’ expertise. Though promising, none of these efforts perform a real-life evaluation of the quality of the built topics. This paper proposes a Semantic Academic Profiler (SAP) framework that...
Article
والصحية لألشخاص. املحتوى املتعلق بالصحة يف وسائل اإلعــالم، وزيــادة املعرفة الرقمية يف ذلك تطوير السياسات القانونية، وإطالق محالت للتوعية وتروجيها، وحتسني القطاعات ملواجهة املعلومات غري الدقيقة واملعلومات الصحية اخلاطئة، بام هلا تأثري سلبي عىل املجتمع. هناك حاجة إىل إجراءات متعددة تشري األدلة املتاحة إىل أن املعلومات غري الدقيقة أثناء حاالت الطوار...
Article
目的 比较和总结与信息流行病和健康错误信息有关的 文献, 并确定在解决信息流行病问题方面所面临的挑 战和机遇。方法 我们已于 2022 年 5 月 6 日搜索了 MEDLINE®、 Embase®、Cochrane 系统评价图书馆、Scopus 和 Epistemonikos, 通过分析信息流行病以及与健康相关的 错误信息、虚假信息和假新闻, 完成了系统评价。我 们基于相似性对研究进行了分组, 并检索了与挑战和 机遇有关的证据。我们使用 AMSTAR-2 方法来评估审 查的方法学质量。为了评估证据的质量, 我们使用了《推荐意见评估、制定和评价分级指南》。 结果 经搜索, 我们发现了 31 篇系统评价, 其中 17 篇 已发表。社交媒体上健康相关错误信息的比例占 0.2% 至 28.8% 不等...
Article
Full-text available
Objectif Comparer et synthétiser la littérature consacrée à l'infodémie et à la désinformation sanitaire, mais aussi identifier les défis et opportunités inhérents à la lutte contre cette problématique. Méthodes Nous avons exploré les bases de données MEDLINE®, Embase®, Cochrane Library of Systematic Reviews, Scopus et Epistemonikos le 6 mai 2022 à...
Article
Цель Сопоставить и обобщить литературу по инфодемии и дезинформации в области здравоохранения, а также определить сложные задачи и возможности для решения проблем инфодемии. Методы 6 мая 2022 г. авторы выполнили поиск информации в базе данных MEDLINE®, Embase®, Cochrane Library of Systematic Reviews, Scopus и Epistemonikos на предмет систематически...
Article
Full-text available
Previous studies that assessed risk factors for venous thromboembolism (VTE) in COVID-19 patients have shown inconsistent results. Our aim was to investigate VTE predictors by both logistic regression (LR) and machine learning (ML) approaches, due to their potential complementarity. This cohort study of a large Brazilian COVID-19 Registry included...
Article
Learning to Rank (L2R) improves ranking quality but relies on the existence of manually labeled training sets, which are expensive and cumbersome to generate. Using automated labeling (e.g., clickthrough data) imposes its own challenges. Active learning (AL) can be used to gather high-quality training data by producing very informative yet small tr...
Chapter
Anais de artigos apresentados durante o III Congresso Mineiro de Epidemiologia, Prevenção e Controle de Infecções e 6º Congresso Mineiro de Infectologia. Belo Horizonte, 12 e 13 de agosto de 2022.
Preprint
Full-text available
The majority prognostic scores proposed for early assessment of coronavirus disease 19 (COVID-19) patients are bounded by methodological flaws. Our group recently developed a new risk score - ABC 2 SPH - using traditional statistical methods (least absolute shrinkage and selection operator logistic regression - LASSO). In this article, we provide a...
Preprint
Full-text available
Objective: To provide a thorough comparative study among state ofthe art machine learning methods and statistical methods for determining in-hospital mortality in COVID 19 patients using data upon hospital admission; to study the reliability of the predictions of the most effective methods by correlating the probability of the outcome and the accur...
Article
Full-text available
Background With the rapid adoption of electronic medical records (EMRs), there is an ever-increasing opportunity to collect data and extract knowledge from EMRs to support patient-centered stroke management. Objective This study aims to compare the effectiveness of state-of-the-art automatic text classification methods in classifying data to suppo...
Method
Worldwide, the fast-paced establishment of information has created a dichotomous reality with accurate and unreliable information. Notably, the overproduction of unfiltered data and the speed at which new information is disseminated create a significant social problem that requires appropriate management measures and monitoring guidelines. Besides...
Conference Paper
Pipelines for Text Classification are sequences of tasks needed to be performed to classify documents. The pre-processing phase of these pipelines involves different ways of manipulating documents for the learning phase. This Master Thesis introduces three new steps into the traditional pre-processing phase: 1) Meta-Features Generation; 2) Sparsifi...
Conference Paper
The definition of a set of informative features capable of representing and discriminating documents is paramount for the task of automatically classifying documents. In this doctoral dissertation, we present the most comprehensive study so far on the role of meta-features (high-level features built from lower-level ones) as an alternative for repr...
Conference Paper
Full-text available
Pipelines for Text Classification are sequences of tasks needed to be performed to classify documents. The pre-processing phase of these pipelines involves different ways of manipulating documents for the learning phase. This Master Thesis introduces three new steps into the traditional pre-processing phase: 1) Meta-Features Generation; 2) Sparsifi...
Article
Full-text available
In this paper, we describe our solution for the task "Profiling Hate Speech Spreaders on Twitter" promoted by PAN CLEF 2021. The task is to identify user profiles that promote hate speech on social media Twitter. Data for 200 users has been made available-each user has a set of 200 posts and the corresponding label (hate speech propagator or not)....
Article
Full-text available
This article brings two major contributions. First, we present the results of a critical analysis of recent scientific articles about neural and non-neural approaches and representations for automatic text classification (ATC). This analysis is focused on assessing the scientific rigor of such studies. It reveals a profusion of potential issues rel...
Article
Recommender Systems (RSs) make personalized suggestions of relevant items to users. However, the concept of relevance may involve different quality aspects (objectives), such as accuracy, novelty, and diversity. In addition, users may have their own expectations regarding what characterizes a good recommendation. More specifically, individual users...
Article
Full-text available
Background Although the potential of big data analytics for health care is well recognized, evidence is lacking on its effects on public health. Objective The aim of this study was to assess the impact of the use of big data analytics on people’s health based on the health indicators and core priorities in the World Health Organization (WHO) Gener...
Preprint
BACKGROUND With the rapid adoption of electronic medical records (EMRs), there is an ever-increasing opportunity to collect data and extract knowledge from EMRs to support patient-centered stroke management. OBJECTIVE This study aims to compare the effectiveness of state-of-the-art automatic text classification methods in classifying data to suppo...
Preprint
BACKGROUND Although the potential of big data analytics for health care is well recognized, evidence is lacking on its effects on public health. OBJECTIVE The aim of this study was to assess the impact of the use of big data analytics on people’s health based on the health indicators and core priorities in the World Health Organization (WHO) Gener...
Conference Paper
Recent advances in text-related tasks on the Web, such as text (topic) classification and sentiment analysis, have been made possible by exploiting mostly the "rule of more": more data (massive amounts) more computing power, more complex solutions. We propose a shift in the paradigm to do "more with less" by focusing, at maximum extent, just on the...
Article
Various e-commerce platforms allow sellers to register, describe and organize their own products, using tags and other textual metadata. The quality of these textual descriptors is essential for the effectiveness of e-commerce information services such as search and product recommendation, and thus, for the ability of consumers to find desired prod...
Article
Random forest (RF) classifiers do excel in a variety of automatic classification tasks, such as topic categorization and sentiment analysis. Despite such advantages, RF models have been shown to perform poorly when facing noisy data, commonly found in textual data, for instance. Some RF variants have been proposed to provide better generalization c...

Network

Cited By