
Andre de Carvalho- PhD
- Universidade de São Paulo at University of São Paulo
Andre de Carvalho
- PhD
- Universidade de São Paulo at University of São Paulo
About
429
Publications
172,184
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
9,467
Citations
Introduction
Current institution
Additional affiliations
January 2000 - December 2000
January 2009 - present
Education
September 1990 - July 1994
Publications
Publications (429)
The adoption of deep learning algorithms in the medical imaging area is a prominent research issue, with high potential for advancing AI-based Computer-aided diagnosis solutions. However, current solutions face challenges due to a lack of interpretability features and high data demands, prompting recent efforts to address these issues. In this stud...
Legislative houses all over the world are adopting tools based on artificial intelligence to support their work. The incorporation of these tools can improve the analysis of the text of the proposed new laws and speed the preparation and discussion of new laws. The performance of artificial intelligence tools for text processing tasks is largely af...
À medida que o armazenamento de sequências biológicas aumenta, extrair informações torna-se crucial para avanços na saúde. A complexidade dessas sequências exige técnicas sofisticadas, como Aprendizado de Máquina (AM). No entanto, desenvolver soluções fortes de AM demanda conhecimento especializado, muitas vezes fora do alcance de muitos pesquisado...
Machine Learning (ML) algorithms have been important tools for the extraction of useful knowledge from biological sequences, particularly in healthcare, agriculture, and the environment. However, the categorical and unstructured nature of these sequences requiring usually additional feature engineering steps, before an ML algorithm can be efficient...
The growing influx of lawsuits in judicial systems presents a pressing challenge for timely case resolution. The Sao Paulo Justice Court is particularly noteworthy, boasting the world’s largest caseload with an 84% congestion rate and an average processing time of over seven years. To address this issue, this article introduces LegalClass, a comput...
The increasing volume and complexity of legal documents have led to a growing interest in text summarizing for legal texts. In this context, this paper presents LegalSum, a tool for automatically summarizing lawsuits in Portuguese, aiming to improve the efficiency of legal professionals and researchers. The tool is equipped with a legal-domain expr...
Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations and their complex interactions, it is common to use optimization techniques to find settings that lead to high predicti...
Machine learning algorithms often contain many hyperparameters whose values affect the predictive performance of the induced models in intricate ways. Due to the high number of possibilities for these hyperparameter configurations, and their complex interactions, it is common to use optimization techniques to find settings that lead to high predict...
Machine Learning has revolutionized the categorization of vast legal documents, minimizing costs and improving evaluations. However, conventional models struggle with unseen data categories in real-world scenarios, a challenge termed Open Set Classification. Our study tackles the issue faced by the Court of Justice in São Paulo, Brazil, to identify...
Given the increasing number of biological sequences stored in databases, there is a large source of information that can benefit several sectors such as agriculture and health. Machine Learning (ML) algorithms can extract useful and new information from these data, increasing social and economic benefits, in addition to productivity. However, the c...
Applying Machine Learning (ML) algorithms to a dataset can be time-consuming. It usually involves, not only selecting and fine-tuning the algorithm, but also other steps, such as data preprocessing. To reduce this time, the whole or a subset of this process has been automated by Automated ML (AutoML) techniques, which can include Bayesian Optimizat...
Data stream applications in highly dynamic environments often face concept drift problems, a phenomenon in which the statistical properties of the variables change over time, which can degrade the performance of Machine Learning models. This work presents a new model monitoring tool through the use of Meta Learning. The algorithm was conceived for...
Several AutoML tools aim to facilitate the usability of machine learning algorithms, automatically recommending algorithms using techniques such as meta-learning, grid search, and genetic programming. However, the preprocessing step is usually not well handled by those tools. Thus, in this work, we present a systematic review of preprocessing algor...
Due to their unique optical and electronic functionalities, chalcogenide glasses are materials of choice for numerous microelectronic and photonic devices. However, to extend the range of compositions and applications, profound knowledge about composition-property relationships is necessary. To this end, we collected a large quantity of composition...
Meta-learning is increasingly used to support the recommendation of machine learning algorithms and their configurations. These recommendations are made based on meta-data, consisting of performance evaluations of algorithms and characterizations on prior datasets. These characterizations, also called meta-features, describe properties of the data...
Initializing the hyper-parameters (HPs) of machine learning (ML) techniques became an important step in the area of automated ML (AutoML). The main premise in HP initialization is that a HP setting that performs well for a certain dataset(s) will also be suitable for a similar dataset. Thus, evaluation of similarities of datasets based on their cha...
Choosing the most suitable algorithm to perform a machine learning task for a new problem is a recurrent and complex task. In multi-target regression tasks, when problem transformation methods are applied, this choice is even harder. The reason is the need to simultaneously choose the problem transformation method and the base learning algorithm. T...
Metalearning has been largely used over the last years to recommend machine learning algorithms for new problems based on past experience. For such, the first step is the creation of metabase, or metadataset, containing metafeatures extracted from several datasets along with the performance of a pool of candidate algorithm(s). The next step is the...
Due to their unique optical and electronic functionalities, chalcogenide glasses are materials of choice for numerous microelectronic and photonic devices. However, to extend the range of compositions and applications, profound knowledge about composition-property relationships is necessary. To this end, we collected a large quantity of composition...
With the advent of powerful computer simulation techniques, it is time to move from the widely used knowledge-guided empirical methods to approaches driven by data science, mainly machine learning algorithms. We investigated the predictive performance of three machine learning algorithms for six different glass properties. For such, we used an exte...
Há algum tempo, a área de inteligência artificial deixou de ser vista apenas como teórica – destinada à aplicação em pequenos problemas “curiosos” – para se tornar um campo de pesquisa crescente, em busca de soluções de problemas reais da sociedade.
Vencedor do Prêmio Jabuti 2012 (Categoria Tecnologia e Informática) quando foi lançado, Inteligênci...
Data stream mining needs to deal with scenarios where data distribution can change over time. As a result, different learning algorithms can be more suitable in different time periods. This paper proposes micro-MetaStream, a meta-learning based method to recommend the most suitable learning algorithm for each new example arriving in a data stream....
A central aspect of online decision trees is evaluating the incoming data and performing model growth. For such, trees much deal with different kinds of input features. Numerical features are no exception, and they pose additional challenges compared to other kinds of features, as there is no trivial strategy to choose the best point to make a spli...
Human Activity Recognition is focused on the use of sensing technology to classify human activities and to infer human behavior. While traditional machine learning approaches use hand-crafted features to train their models, recent advancements in neural networks allow for automatic feature extraction. Auto-encoders are a type of neural network that...
Imbalanced datasets are an important challenge in supervised Machine Learning (ML). According to the literature, class imbalance does not necessarily impose difficulties for ML algorithms. Difficulties mainly arise from other characteristics, such as overlapping between classes and complex decision boundaries. For binary classification tasks, calcu...
A central aspect of online decision tree solutions is evaluating the incoming data and enabling model growth. For such, trees much deal with different kinds of input features and partition them to learn from the data. Numerical features are no exception, and they pose additional challenges compared to other kinds of features, as there is no trivial...
Meta-learning has been successfully applied to time series forecasting. For such, it uses meta-datasets created by previous machine learning applications. Each row in a meta-dataset represents a time series dataset. Each row, apart from the last, is meta-feature describing aspects of the related dataset. The last column is a target value, a meta-la...
Incremental machine learning algorithms have been effective alternatives to deal with stream data. The Hoeffding Tree framework is one of the most successful solutions for supervised online prediction tasks. Although online regression tasks are present in several forms, and in many real-life problems, most of the research efforts have been devoted...
Classification tasks using imbalanced datasets are not challenging on their own. Classification models perform poorly on the minority class when the datasets present other difficulties, such as class overlap and complex decision border. Data complexity measures can identify such difficulties, better dealing with imbalanced datasets. They can captur...
Meta-Learning has been largely used over the last years to support the recommendation of the most suitable machine learning algorithm(s) and hyperparameters for new datasets. Traditionally, a meta-base is created containing meta-features extracted from several datasets along with the performance of a pool of machine learning algorithms when applied...
This paper presents an experimental comparison among four Automated Machine Learning (AutoML) methods for recommending the best classification algorithm for a given input dataset. Three of these methods are based on Evolutionary Algorithms (EAs), and the other is Auto-WEKA, a well-known AutoML method based on the Combined Algorithm Selection and Hy...
With the advent of powerful computer simulation techniques, it is time to move from the widely used knowledge-guided empirical methods to approaches driven by data science, mainly machine learning algorithms. Due to their (hidden) smooth composition-property relationships, this strategy is especially relevant for the development of new glasses. We...
Investigating strategies that are able to efficiently deal with multi-label classification tasks is a current research topic in machine learning. Many methods have been proposed, making the selection of the most suitable strategy a challenging issue. From this premise, this paper presents an extensive empirical analysis of the binary transformation...
Machine Learning (ML) algorithms have been successfully employed by a vast range of practitioners with different backgrounds. One of the reasons for ML popularity is the capability to consistently delivers accurate results, which can be further boosted by adjusting hyperparameters (HP). However, part of practitioners has limited knowledge about the...
Several studies in the field of human–computer interaction have focused on the importance of emotional factors related to the interaction of humans with computer systems. According to the knowledge of the users’ emotions, intelligent software can be developed for interacting and even influencing users. However, such a scenario is still a challenge...
Modern technologies demand the development of new glasses with unusual properties. Most of the previous developments occurred by slow, expensive trial-and-error approaches, which have produced a considerable amount of data over the past 100 years. By finding patterns in such types of data, Machine Learning (ML) algorithms can extract useful knowled...
Human Activity Recognition is a machine learning task for the classification of human physical activities. Applications for that task have been extensively researched in recent literature, specially due to the benefits of improving quality of life. Since wearable technologies and smartphones have become more ubiquitous, a large amount of informatio...
Automated recommendation of machine learning algorithms is receiving a large deal of attention, not only because they can recommend the most suitable algorithms for a new task, but also because they can support efficient hyper-parameter tuning, leading to better machine learning solutions. The automated recommendation can be implemented using meta-...
Image segmentation is a key issue in image processing. New image segmentation algorithms have been proposed in the last years. However, there is no optimal algorithm for every image processing task. The selection of the most suitable algorithm usually occurs by testing every possible algorithm or using knowledge from previous problems. These proces...
In data streams new classes can appear over time due to changes in the data statistical distribution. Consequently, models can become outdated, which requires the use of incremental learning algorithms capable of detecting and learning the changes over time. However, when a single classification model is used for novelty detection, there is a risk...
Data streams are related to large amounts of data that can continuously arrive with a probability distribution that may change over time. Depending on the changes in the data distribution, different phenomena can occur, like new classes can appear or concept drift can occur in existing classes. Machine Learning algorithms have been often used to mo...
Human mobility has a significant impact on several layers of society, from infrastructural planning and economics to the spread of diseases and crime. Representing the system as a complex network, in which nodes are assigned to regions (e.g., a city) and links indicate the flow of people between two of them, physics-inspired models have been propos...
For many machine learning algorithms, predictive performance is critically affected by the hyperparameter values used to train them. However, tuning these hyperparameters can come at a high computational cost, especially on larger datasets, while the tuned settings do not always significantly outperform the default values. This paper proposes a rec...
For many machine learning algorithms, predictive performance is critically affected by the hyperparameter values used to train them. However, tuning these hyperparameters can come at a high computational cost, especially on larger datasets, while the tuned settings do not always significantly outperform the default values. This paper proposes a rec...
The amount of available data raises at large steps. Developing machine learning strategies to cope with the high throughput and changing data streams is a scope of high relevance. Among the prediction tasks in online machine learning, multi-target regression has gained increased attention due to its high applicability and relation with real-world p...
Humans are frequently looking for patterns and uniformity to support their choices and decisions. Whatever falls outside the expected can be said to be an anomaly. However, in many practical situations, the presence of anomalies can provide valuable insights, which can point out useful novelties. Thus, in predictive maintenance, for example, anomal...
Imbalanced datasets may negatively impact the predictive performance of most classical classification algorithms. This problem, commonly found in real-world, is known in machine learning domain as imbalanced learning. Most techniques proposed to deal with imbalanced learning have been proposed and applied only to binary classification. When applied...
Hierarchical Multi-Label Classification is a challenging classification task where the classes are hierarchically structured, with superclass and subclass relationships. It is a very common task, for instance, in Protein Function Prediction, where a protein can simultaneously perform multiple functions. In these tasks it is very difficult to achiev...
Human Activity Recognition has been primarily investigated as a machine learning classification task forcing it to handle with two main limitations. First, it must assume that the testing data has an equal distribution with the training sample. However, the inherent structure of an activity recognition systems is fertile in distribution changes ove...
Data stream is a challenging research topic in which data can continuously arrive with a probability distribution that may change over time. Depending on the changes in the data distribution, different phenomena can occur, for example, a concept drift. A concept drift occurs when the concepts associated with a dataset change when new data arrive. T...
Human activity recognition (HAR) is a classification task that aims to classify human activities or predict human behavior by means of features extracted from sensors data. Typical HAR systems use wearable sensors and/or handheld and mobile devices with built-in sensing capabilities. Due to the widespread use of smartphones and to the inclusion of...
Recently, several classification algorithms capable of dealing with potentially infinite data streams have been proposed.
One of the main challenges of this task is to continuously update predictive models to address concept drifts without compromise their predictive performance. Moreover, the classification algorithm used must be able to efficient...
Meta-learning has been successfully used for algorithm recommendation tasks. It uses machine learning to induce meta-models able to predict the best algorithms for a new dataset. In this paper, meta-models are applied to a set of meta-features, describing a dataset, to predict the performance of clustering algorithms applied to this dataset. The pa...
As Collaborative Filtering becomes increasingly important in both academia and industry recommendation solutions, it also becomes imperative to study the algorithm selection task in this domain. This problem aims at finding automatic solutions which enable the selection of the best algorithms for a new problem, without performing full-fledged train...
Algorithm selection using Metalearning aims to find mappings between problem characteristics (i.e. metafeatures) with relative algorithm performance to predict the best algorithm(s) for new datasets. Therefore, it is of the utmost importance that the metafeatures used are informative. In Collaborative Filtering, recent research has created an exten...
Dealing with memory and time constraints are current challenges when learning from data streams with a massive amount of data. Many algorithms have been proposed to handle these difficulties, among them, the Very Fast Decision Tree (VFDT) algorithm. Although the VFDT has been widely used in data stream mining, in the last years, several authors hav...
Noise is often present in real datasets used for training Machine Learning classifiers. Their disruptive effects in the learning process may include: increasing the complexity of the induced models, a higher processing time and a reduced predictive power in the classification of new examples. Therefore, treating noisy data in a preprocessing step i...
Meta-learning is increasingly used to support the recommendation of machine learning algorithms and their configurations. Such recommendations are made based on meta-data, consisting of performance evaluations of algorithms on prior datasets, as well as characterizations of these datasets. These characterizations, also called meta-features, describ...
Motivation:
With the recent advances in DNA sequencing technologies, the study of the genetic composition of living organisms has become more accessible for researchers. Several advances have been achieved because of it, especially in the health sciences. However, many challenges which emerge from the complexity of sequencing projects remain unsol...
The glass transition temperature (Tg) is a kinetic property of major importance for both fundamental and applied glass science. In this study, we designed and trained an artificial neural network to induce a model that can predict the Tg of multicomponent oxide glasses. To do this, we used a dataset containing more than 55,000 inorganic glass compo...
To select the best algorithm for a new problem is an expensive and difficult task. However, there are automatic solutions to address this problem: using Metalearning, which takes advantage of problem characteristics (i.e. metafeatures), one is able to predict the relative performance of algorithms. In the Collaborative Filtering scope, recent works...
Food trucks are a widely popular fast food restaurant alternative, whose differentiating factor is their proximity to customers. Their popularity has stimulated the expansion of available options, which now includes several different types of cuisines, consequently making the choice by customers a challenging issue. From data obtained via a market...
This paper addresses the Cluster Editing problem. The objective of this problem is to transform a graph into a disjoint union of cliques using a minimum number of edge modifications. This problem has been considered in the context of bioinformatics, document clustering, image segmentation, consensus clustering, qualitative data clustering among oth...
Many real-world situations constantly generate concept-drifting data streams at high speed. These situations demand adaptive algorithms able to learn online in accordance with the most recent target function (concept). This paper presents Online Adaptive Classifier Ensemble, a new ensemble algorithm able to learn from concept-drifting data streams....
This chapter describes a new group of predictive learning algorithms – search‐based and optimization‐based algorithms – which allow us to deal efficiently with more complex classification tasks. Decision tree induction algorithms (DTIAs) induce models with a tree‐shaped decision structure where each internal node is associated with one or more pred...
This chapter describes the three current fields of data analytics that are attracting a great deal of attention due to their wide application in different domains: text mining, social network analysis (SNA) and recommendation systems. Text mining is a very active area of data analytics. Text mining is an important part of several other tasks, like...
Classification is one of the most common tasks in analytics, and the most common in predictive analytics. In addition to being the most common classification task, binary classification is the simplest classification task. This chapter illustrates how the data are distributed in the data set. The main concern of most data analytic applications is t...
Predictive tasks are divided between classification tasks and regression tasks. This chapter focusses mainly on regression. It describes the concepts that are meaningful for both regression and classification, namely generalization and model validation. The chapter also describes some of the most popular regression methods. The methods described ar...
This chapter explores the advanced subjects in predictive analytics. The individual classifiers whose predictions will be combined will be referred to here as the “base” classifiers. Each base classifier can be induced using the same, original, training set, or parts of the original training set. Two important requirements in developing ensembles w...
This chapter focuses on how a data set can be described by descriptive statistics and by visualization techniques for single attributes and pairs of attributes. It presents several univariate and bivariate statistical formulae and data visualization techniques. The chapter describes the different scale types that exist to describe data. There are t...
This chapter presents a cheat sheet of descriptive analytics. The main purpose of descriptive analytics is to understand the data, providing relevant knowledge for future decisions in the project development. It presents the main aspects of the univariate methods that is, methods used to summarize a single attribute. It also presents a summary of b...
This chapter presents an important family of techniques for descriptive tasks. They can describe a data set by partitioning it, so that objects in the same group are similar to each other. These “clustering” techniques have been developed and extensively used to partition data sets into groups. Clustering techniques use only predictive attributes t...
This chapter discusses the frequent itemset mining, describing the three main approaches: Apriori, Eclat and frequent pattern growth (FP‐Growth). Frequent pattern mining methods were developed to deal with very large data sets recorded in hypermarkets and social media sites. The chapter discusses the min_sup threshold, a hyper‐parameter with high i...
This chapter discusses the aspects of data quality and describes the preprocessing techniques frequently used in data analytics. The quality of a dataset strongly affects the results of a data quality project. The chapter also discusses the techniques for data‐type conversions, a necessary operation when the values of a predictive attribute need to...
This introduction presents an overview of key concepts discussed in the subsequent chapters of the book. The book describes two real‐world problems from different areas as an introduction to the different subjects. It explains the multi‐layer perceptron neural networks and k‐means. The book explores the methodologies for planning and developing pro...
This chapter explores a project that relates to the CRoss‐Industry Standard Process for Data Mining (CRISP‐DM) methodology. The data used can be obtained in the UCI machine learning repository, easily obtainable in the web, entitled “Polish companies bankruptcy data”. The chapter presents a cheat sheet on predictive algorithms. Investors, banks and...
This chapter describes simple multivariate methods from the three data analysis approaches – frequency, visualization and statistical. The multivariate frequency values can be computed independently for each attribute. The chapter explores how multivariate data can be visually represented in different ways and the main benefits of each of these alt...
Dealing with memory and time constraints are current challenges when learning from data streams with a massive amount of data. Many algorithms have been proposed to handle these difficulties, among them, the Very Fast Decision Tree (VFDT) algorithm. Although the VFDT has been widely used in data stream mining, in the last years, several authors hav...