Heitor Murilo Gomes

Heitor Murilo Gomes
Victoria University of Wellington · School of Engineering and Computer Science

PhD

About

71
Publications
44,981
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,260
Citations
Introduction
My main research area is Machine Learning, specially Evolving Data Streams, Concept Drift, Ensemble methods and Big Data Streams. I contribute to both MOA and StreamDM open data stream mining projects. More information: http://www.heitorgomes.com

Publications

Publications (71)
Article
Full-text available
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integr...
Article
Full-text available
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests alg...
Article
Full-text available
A large portion of the stream mining studies on classification rely on the availability of true labels immediately after making predictions. This approach is well exemplified by the test-then-train evaluation, where predictions immediately precede true label arrival. However, in many real scenarios, labels arrive with non-negligible latency. This r...
Article
Full-text available
Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., V...
Conference Paper
Full-text available
Ensemble methods are a popular choice for learning from evolving data streams. This popularity is due to (i) the ability to simulate simple, yet, successful ensemble learning strategies, such as bagging and random forests; (ii) the possibility of incorporating drift detection and recovery in conjunction to the ensemble algorithm; (iii) the availabi...
Article
Full-text available
Most research in machine learning for data streams has focused on classification algorithms, whereas regression methods have received a lot less attention. This paper proposes Self-Optimising K-Nearest Leaves (SOKNL), a novel forest-based algorithm for streaming regression problems. Specifically, the Adaptive Random Forest Regression, a state-of-th...
Article
Full-text available
Concept drift detection is a crucial task in data stream evolving environments. Most of state of the art approaches designed to tackle this problem monitor the loss of predictive models. However, this approach falls short in many real-world scenarios, where the true labels are not readily available to compute the loss. In this context, there is inc...
Preprint
Full-text available
Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted mac...
Article
In many real-world domains, data can naturally be represented as networks. This is the case of social networks, bibliographic networks, sensor networks and biological networks. Some dynamism often characterizes these networks as their structure (i.e., nodes and edges) continually evolves. Considering this dynamism is essential for analyzing these n...
Article
Full-text available
Unlabelled data appear in many domains and are particularly relevant to streaming applications, where even though data is abundant, labelled data is rare. To address the learning problems associated with such data, one can ignore the unlabelled data and focus only on the labelled data (supervised learning); use the labelled data and attempt to leve...
Article
Full-text available
Decision tree ensembles are widely used in practice. In this work, we study in ensemble settings the effectiveness of replacing the split strategy for the state-of-the-art online tree learner, Hoeffding Tree, with a rigorous but more eager splitting strategy that we had previously published as Hoeffding AnyTime Tree. Hoeffding AnyTime Tree (HATT),...
Preprint
Full-text available
In recent years, the Edge Computing (EC) paradigm has emerged as an enabling factor for developing technologies like the Internet of Things (IoT) and 5G networks, bridging the gap between Cloud Computing services and end-users, supporting low latency, mobility, and location awareness to delay-sensitive applications. Most solutions in EC employ mach...
Article
Full-text available
Understanding how machine learning algorithms can be used for stream processing on edge devices remains an important challenge. Such ML algorithms can be represented as operators and dynamically adapted based on the resources on which they are hosted. Deploying machine learning algorithms on edge resources often focuses on carrying out inference on...
Preprint
Full-text available
Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data mining, stream processing algorithms have additional requirements regarding computational resources and adaptability...
Article
Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data mining, stream processing algorithms have additional requirements regarding computational resources and adaptability...
Article
Full-text available
Ensemble methods represent an effective way to solve supervised learning problems. Such methods are prevalent for learning from evolving data streams. One of the main reasons for such popularity is the possibility of incorporating concept drift detection and recovery strategies in conjunction with the ensemble algorithm. On top of that, successful...
Preprint
Full-text available
Unlabelled data appear in many domains and are particularly relevant to streaming applications, where even though data is abundant, labelled data is rare. To address the learning problems associated with such data, one can ignore the unlabelled data and focus only on the labelled data (supervised learning); use the labelled data and attempt to leve...
Article
Full-text available
The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In p...
Preprint
Full-text available
Concept drift detection is a crucial task in data stream evolving environments. Most of state of the art approaches designed to tackle this problem monitor the loss of predictive models. However, this approach falls short in many real-world scenarios, where the true labels are not readily available to compute the loss. In this context, there is inc...
Conference Paper
Full-text available
Machine Learning techniques have been employed in virtually all domains in the past few years. New applications demand the ability to cope with dynamic environments like data streams with transient behavior. Such environments present new requirements like incrementally process incoming data instances in a single pass, under both memory and time con...
Preprint
Full-text available
River is a machine learning library for dynamic data streams and continual learning. It provides multiple state-of-the-art learning methods, data generators/transformers, performance metrics and evaluators for different stream learning problems. It is the result from the merger of the two most popular packages for stream learning in Python: Creme a...
Conference Paper
Ensemble-based methods are one of the most often used methods in the classification task that have been adapted to the stream setting because of their high learning performance achievement. For instance, Adaptive Random Forests (ARF) is a recent ensemble method for evolving data streams that proved to be of a good predictive performance but, as all...
Preprint
Machine Learning (ML) has been widely applied to cybersecurity, and is currently considered state-of-the-art for solving many of the field's open issues. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas (at least not in the same way). One of these cha...
Conference Paper
Full-text available
Concept drift detection is a crucial task in data stream evolving environments. Most of the state of the art approaches designed to tackle this problem monitor the loss of predictive models. Accordingly, an alarm is launched when the loss increases significantly, which triggers some adaptation mechanism (e.g. retrain the model). However, this modus...
Preprint
Full-text available
We study the effectiveness of replacing the split strategy for the state-of-the-art online tree learner, Hoeffding Tree, with a rigorous but more eager splitting strategy. Our method, Hoeffding AnyTime Tree (HATT), uses the Hoeffding Test to determine whether the current best candidate split is superior to the current split, with the possibility of...
Chapter
Concept drift detection is a crucial task in data stream evolving environments. Most of the state of the art approaches designed to tackle this problem monitor the loss of predictive models. Accordingly, an alarm is launched when the loss increases significantly, which triggers some adaptation mechanism (e.g. retrain the model). However, this modus...
Chapter
Full-text available
A dynamic attributed graph is a graph that changes over time and where each vertex is described using multiple continuous attributes. Such graphs are found in numerous domains, e.g., social network analysis. Several studies have been done on discovering patterns in dynamic attributed graphs to reveal how attribute(s) change over time. However, many...
Conference Paper
Full-text available
For many streaming classification tasks, the ground truth labels become available with a non-negligible latency. Given this delayed labelling setting, after the instance data arrives and before its true label is known, the online classifier model may change. Hence, the initial prediction can be replaced with additional periodic predictions graduall...
Conference Paper
Full-text available
An ensemble of learners tends to exceed the pre-dictive performance of individual learners. This approach has been explored for both batch and online learning. Ensembles methods applied to data stream classification were thoroughly investigated over the years, while their regression counterparts received less attention in comparison. In this work,...
Conference Paper
Full-text available
Mining high-dimensional data streams poses a fundamental challenge to machine learning as the presence of high numbers of attributes can remarkably degrade any mining task's performance. In the past several years, dimension reduction (DR) approaches have been successfully applied for different purposes (e.g., visualization). Due to their high-compu...
Conference Paper
Full-text available
Assigning scores to individual features is a popular method for estimating the relevance of features in supervised learning. An accurate feature score estimation provides essential insights in sensitive domains, which is decisive to explain how features influence a given decision, contributing to the inter-pretability of the model. Learning from st...
Chapter
Ensemble classifiers are a promising approach for data stream classification. Though, diversity influences the performance of ensemble classifiers, current studies do not take advantage of relations between component classifiers to improve their performance. This paper addresses this issue by proposing a new kind of ensemble learner for data stream...
Conference Paper
Full-text available
The use of Machine Learning (ML) techniques for malware detection has been a trend in the last two decades. More recently, researchers started to investigate adversarial approaches to bypass these ML-based malware detectors. Adversarial attacks became so popular that a large Internet company has launched a public challenge to encourage researchers...
Article
Full-text available
The Publisher regrets an error in the spelling of the family name of the sixth author. The correct spelling is Bernhard Pfahringer, as it appears in the author list above.
Conference Paper
Trust mechanisms are considered the logical protection of software systems, preventing malicious people from taking advantage or cheating others. Although these concepts are widely used, most applications in this field do not consider affective aspects to aid in trust computation. Researchers of Psychology, Neurology, Anthropology, and Computer Sci...
Article
Full-text available
Feature selection targets the identification of which features of a dataset are relevant to the learning task. It is also widely known and used to improve computation times, reduce computation requirements, and to decrease the impact of the curse of dimensionality and enhancing the generalization rates of classifiers. In data streams, classifiers s...
Chapter
Full-text available
Extracting useful patterns from data is a challenging task that has been extensively investigated by both machine learning researchers and practitioners for many decades. This task becomes even more problematic when data is presented as a potentially unbounded sequence, the so-called data streams. Albeit most of the research on data stream mining f...
Article
Full-text available
Learning from ephemeral data streams has garnered the interest of both researchers and practitioners towards adaptive learning techniques. Despite the convincing results obtained thus far, most of the current research still overlooks that the relevance of features may change throughout the learning process. Scenarios where features become - or ceas...
Conference Paper
Full-text available
The financial market is one of the major consumers of data mining techniques, and the main reason is their efficiency to analyze complex data. One important trait shared between most financial applications is class imbalance. Since traditional classification methods assume nearly balanced classes and equal misclassification costs, they usually fail...
Conference Paper
Full-text available
Data stream mining is a hot topic in the machine learning community that tackles the problem of learning and updating predictive models as new data becomes available over time. Even though several new methods are proposed every year, most focus on the classification task and overlook the regression task. In this paper, we propose an adaptation to t...
Article
The fundamental role for poultry farmers to be successful in their activities is to precisely increase, decrease, or maintain, in a short time span, factors that determine poultry growth, such as humidity, temperature, amount of feed ration, ventilation, and others. Although there are modern automatic control technologies supporting these aspects,...
Conference Paper
Full-text available
Peer-to-peer (P2P) lending is a global trend of financial markets that allow individuals to obtain and concede loans without having financial institutions as a strong proxy. As many real-world applications, P2P lending presents an imbalanced characteristic, where the number of creditworthy loan requests is much larger than the number of non-creditw...
Conference Paper
Full-text available
Extracting useful knowledge from data streams is problematic, mainly due to changes in their data distribution, a phenomenon named concept drift. Recently, studies have shown that most of existing algorithms for learning from data streams do not encompass techniques for a specific kind of drift: feature drifts. Feature drifts occur when features be...
Conference Paper
Full-text available
The ever increasing data generation confronts both practitioners and researchers on handling massive and sequentially generated amounts of information, the so-called data streams. In this context, a lot of effort has been put on the extraction of useful patterns from streaming scenarios. Learning from data streams embeds a variety of problems, and...
Conference Paper
Full-text available
The ubiquity of data streams has been encouraging the development of new incremental and adaptive learning algorithms. Data stream learners must be fast, memory-bounded, but mainly, tailored to adapt to possible changes in the data distribution, a phenomenon named concept drift. Recently, several works have shown the impact of a so far nearly negle...
Article
Full-text available
Data stream mining is a fast growing research topic due to the ubiquity of data in several real-world problems. Given their ephemeral nature, data stream sources are expected to undergo changes in data distribution, a phenomenon called concept drift. This paper focuses on one specific type of drift that has not yet been thoroughly studied, namely f...
Article
Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally as data arrives. On top of that, due to the inherent evolving nature of data streams, it is expected that algorithms undergo both concept drifts and evolutions, which must be taken into account by the clust...
Conference Paper
Data stream classification has grown in importance in recent years due to the large amount of data rapidly generated in a multitude of domains. Many algorithms were proposed to deal with the problems associated with data stream classification (e.g. concept drifts) with special attention to ensemble methods. Ensembles are often preferred due to thei...
Conference Paper
Full-text available
Mining data streams is of the utmost importance due to its appearance in many real-world situations, such as: sensor networks, stock market analysis and computer networks intrusion detection systems. Data streams are, by definition, potentially unbounded sequences of data that arrive intermittently at rapid rates. Extracting useful knowledge from d...
Conference Paper
Full-text available
The increased use of data mining algorithms reflects the need for automatic extraction of knowledge from large volumes of data. This work presents a temporal data mining algorithm that discovers frequent Association Rules from timestamped data. These rules are named Cause-Effect Rules, each represented by a multiset of unordered events (Cause) foll...
Conference Paper
Full-text available
Data stream mining is an active area of research that poses challenging research problems. In the latter years, a variety of data stream clustering algorithms have been proposed to perform unsupervised learning using a two-step framework. Additionally, dealing with non-stationary, unbounded data streams requires the development of algorithms capabl...
Conference Paper
Full-text available
Learning from data streams requires efficient algorithms capable of deriving a model accordingly to the arrival of new instances. Data streams are by definition unbounded sequences of data that are possibly non stationary, i.e. they may undergo changes in data distribution, phenomenon named concept drift. Concept drifts force streaming learning alg...
Conference Paper
Full-text available
Data Stream Clustering is an active area of research which requires efficient algorithms capable of finding and updating clusters incrementally. On top of that, due to the inherent evolving nature of data streams, it is expected that these algorithms manage to quickly adapt to both concept drifts and the appearance and disappearance of clusters. Ne...