Heitor Murilo GomesVictoria University of Wellington · School of Engineering and Computer Science
Heitor Murilo Gomes
PhD
About
100
Publications
61,010
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,724
Citations
Introduction
My main research area is Machine Learning, specially Evolving Data Streams, Concept Drift, Ensemble methods and Big Data Streams.
I contribute to both MOA and StreamDM open data stream mining projects.
More information: http://www.heitorgomes.com
Publications
Publications (100)
Ensemble-based methods are among the most widely used techniques for data stream classification. Their popularity is attributable to their good performance in comparison to strong single learners while being relatively easy to deploy in real-world applications. Ensemble algorithms are especially useful for data stream learning as they can be integr...
Random forests is currently one of the most used machine learning algorithms in the non-streaming (batch) setting. This preference is attributable to its high learning performance and low demands with respect to input preparation and hyper-parameter tuning. However, in the challenging context of evolving data streams, there is no random forests alg...
A large portion of the stream mining studies on classification rely on the availability of true labels immediately after making predictions. This approach is well exemplified by the test-then-train evaluation, where predictions immediately precede true label arrival. However, in many real scenarios, labels arrive with non-negligible latency. This r...
Incremental learning, online learning, and data stream learning are terms commonly associated with learning algorithms that update their models given a continuous influx of data without performing multiple passes over data. Several works have been devoted to this area, either directly or indirectly as characteristics of big data processing, i.e., V...
Ensemble methods are a popular choice for learning from evolving data streams. This popularity is due to (i) the ability to simulate simple, yet, successful ensemble learning strategies, such as bagging and random forests; (ii) the possibility of incorporating drift detection and recovery in conjunction to the ensemble algorithm; (iii) the availabi...
This paper introduces a group of novel datasets representing real-time time-series and streaming data of energy prices in New Zealand, sourced from the Electricity Market Information (EMI) website maintained by the New Zealand government. The datasets are intended to address the scarcity of proper datasets for streaming regression learning tasks. W...
In an era where machine learning permeates every facet of human existence, and data evolves incessantly, the application of machine learning models transcends mere data processing. It involves navigating constant changes exemplified by the phenomenon of concept drift, which often affects model performance. These drifts can be recurrent due to the c...
New Zealand's unique ecosystems face increasing threats from climate change, impacting biodiversity and posing challenges to safety, livelihoods, and well-being. To tackle these complex issues, advanced data science and artificial intelligence techniques can provide unique solutions. Currently, in its fourth year of a seven-year program, TAIAO focu...
Gradient Boosting is a widely-used machine learning technique that has proven highly effective in batch learning. However, its effectiveness in stream learning contexts lags behind bagging-based ensemble methods, which currently dominate the field. One reason for this discrepancy is the challenge of adapting the booster to new concept following a c...
In IoT environment applications generate continuous non-stationary data streams with in-built problems of concept drift and class imbalance which cause classifier performance degradation. The imbalanced data affects the classifier during concept detection and concept adaptation. In general, for concept detection, a separate mechanism is added in pa...
Continual learning aims to create artificial neural networks capable of accumulating knowledge and skills through incremental training on a sequence of tasks. The main challenge of continual learning is catastrophic interference, wherein new knowledge overrides or interferes with past knowledge, leading to forgetting. An associated issue is the pro...
Machine Learning (ML) has been widely applied to cybersecurity and is considered state-of-the-art for solving many of the open issues in that field. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas. One of these challenges is the concept drift, which...
Stream Learning (SL) attempts to learn from a data stream efficiently. A data stream learning algorithm should adapt to input data distribution shifts without sacrificing accuracy. These distribution shifts are known as ”concept drifts” in the literature. SL provides many supervised, semi-supervised, and unsupervised methods for detecting and adjus...
Purpose
This study aims to investigate whether Brazilian companies have increased their reporting on biodiversity within the past decade and whether reporting practices are linked to the government's stance on environmental protection, media coverage and industry biodiversity risk.
Design/methodology/approach
Using content analysis and ordinary le...
Mining data streams is one of the main studies in machine learning area due to its application in many knowledge areas. One of the major challenges on mining data streams is concept drift, which requires the learner to discard the current concept and adapt to a new one. Ensemble-based drift detection algorithms have been used successfully to the cl...
Every application in a smart city environment like the smart grid, health monitoring, security, and surveillance generates non-stationary data streams. Due to such nature, the statistical properties of data changes over time, leading to class imbalance and concept drift issues. Both these issues cause model performance degradation. Most of the curr...
Today’s malware variants are growing at an unprecedented rate. To avoid detection by existing antivirus engines, attackers have been increasing the complexity of packers, layers of obfuscation, and encryption to obstruct the process of reverse engineering. This paper presents an automated method using static analysis for extracting opcode sequences...
The performance of machine learning models diminishes while predicting the Remaining Useful Life (RUL) of the equipment or fault prediction due to the issue of concept drift. This issue is aggravated when the problem setting comprises multi-class imbalanced data. The existing drift detection methods are designed to detect certain drifts in specific...
Continual Learning (CL) poses a significant challenge to Neural Network (NN)s, where the data distribution changes from one task to another. In Online domain incremental continual learning (OD-ICL), this distribution change happens in the input space without affecting the label distribution. In order to adapt to such changes, the model being traine...
Continual Learning (CL) problems pose significant challenges for Neural Network (NN)s. Online Domain Incremental Continual Learning (ODI-CL) refers to situations where the data distribution may change from one task to another. These changes can severely affect the learned model, focusing too much on previous data and failing to properly learn and r...
Behavior-based machine learning plays a vital role in malware classification, as it potentially overcomes the limitations of signature-based methods. This paper explores the use of dynamic call sequences as extracted by the open source Noriben tool, which employs dynamic analysis in a virtualized environment. Call sequences of a length of up to 500...
Most research in machine learning for data streams has focused on classification algorithms, whereas regression methods have received a lot less attention. This paper proposes Self-Optimising K-Nearest Leaves (SOKNL), a novel forest-based algorithm for streaming regression problems. Specifically, the Adaptive Random Forest Regression, a state-of-th...
Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted mac...
Source code is available on the latest version of MOA
https://github.com/Waikato/moa
Please look for CAND module (src/main/java/moa/classifiers/deeplearning/CAND.java)
Understanding how machine learning algorithms can be used for stream processing on edge devices remains an important challenge. Such ML algorithms can be represented as operators and dynamically adapted based on the resources on which they are hosted. Deploying machine learning algorithms on edge resources often focuses on carrying out inference on...
Concept drift detection is a crucial task in data stream evolving environments. Most of state of the art approaches designed to tackle this problem monitor the loss of predictive models. However, this approach falls short in many real-world scenarios, where the true labels are not readily available to compute the loss. In this context, there is inc...
Malware is a major threat to computer systems and imposes many challenges to cyber security. Targeted threats, such as ransomware, cause millions of dollars in losses every year. The constant increase of malware infections has been motivating popular antiviruses (AVs) to develop dedicated detection strategies, which include meticulously crafted mac...
In many real-world domains, data can naturally be represented as networks. This is the case of social networks, bibliographic networks, sensor networks and biological networks. Some dynamism often characterizes these networks as their structure (i.e., nodes and edges) continually evolves. Considering this dynamism is essential for analyzing these n...
Unlabelled data appear in many domains and are particularly relevant to streaming applications, where even though data is abundant, labelled data is rare. To address the learning problems associated with such data, one can ignore the unlabelled data and focus only on the labelled data (supervised learning); use the labelled data and attempt to leve...
Decision tree ensembles are widely used in practice. In this work, we study in ensemble settings the effectiveness of replacing the split strategy for the state-of-the-art online tree learner, Hoeffding Tree, with a rigorous but more eager splitting strategy that we had previously published as Hoeffding AnyTime Tree. Hoeffding AnyTime Tree (HATT),...
In recent years, the Edge Computing (EC) paradigm has emerged as an enabling factor for developing technologies like the Internet of Things (IoT) and 5G networks, bridging the gap between Cloud Computing services and end-users, supporting low latency, mobility, and location awareness to delay-sensitive applications. Most solutions in EC employ mach...
Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data. The most common method to handle the class imbalan...
In recent years, the Edge Computing (EC) paradigm has emerged as an enabling factor for developing technologies like the Internet of Things (IoT) and 5G networks, bridging the gap between Cloud Computing services and end-users, supporting low latency, mobility, and location awareness to delay-sensitive applications. An increasing number of solution...
Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data mining, stream processing algorithms have additional requirements regarding computational resources and adaptability...
Often, machine learning applications have to cope with dynamic environments where data are collected in the form of continuous data streams with potentially infinite length and transient behavior. Compared to traditional (batch) data mining, stream processing algorithms have additional requirements regarding computational resources and adaptability...
Ensemble methods represent an effective way to solve supervised learning problems. Such methods are prevalent for learning from evolving data streams. One of the main reasons for such popularity is the possibility of incorporating concept drift detection and recovery strategies in conjunction with the ensemble algorithm. On top of that, successful...
Unlabelled data appear in many domains and are particularly relevant to streaming applications, where even though data is abundant, labelled data is rare. To address the learning problems associated with such data, one can ignore the unlabelled data and focus only on the labelled data (supervised learning); use the labelled data and attempt to leve...
The significant growth of interconnected Internet‐of‐Things (IoT) devices, the use of social networks, along with the evolution of technology in different domains, lead to a rise in the volume of data generated continuously from multiple systems. Valuable information can be derived from these evolving data streams by applying machine learning. In p...
Concept drift detection is a crucial task in data stream evolving environments. Most of state of the art approaches designed to tackle this problem monitor the loss of predictive models. However, this approach falls short in many real-world scenarios, where the true labels are not readily available to compute the loss. In this context, there is inc...
Machine Learning techniques have been employed in virtually all domains in the past few years. New applications demand the ability to cope with dynamic environments like data streams with transient behavior. Such environments present new requirements like incrementally process incoming data instances in a single pass, under both memory and time con...
River is a machine learning library for dynamic data streams and continual learning. It provides multiple state-of-the-art learning methods, data generators/transformers, performance metrics and evaluators for different stream learning problems. It is the result from the merger of the two most popular packages for stream learning in Python: Creme a...
Ensemble-based methods are one of the most often used methods in the classification task that have been adapted to the stream setting because of their high learning performance achievement. For instance, Adaptive Random Forests (ARF) is a recent ensemble method for evolving data streams that proved to be of a good predictive performance but, as all...
Machine Learning (ML) has been widely applied to cybersecurity, and is currently considered state-of-the-art for solving many of the field's open issues. However, it is very difficult to evaluate how good the produced solutions are, since the challenges faced in security may not appear in other areas (at least not in the same way). One of these cha...
Concept drift detection is a crucial task in data stream evolving environments. Most of the state of the art approaches designed to tackle this problem monitor the loss of predictive models. Accordingly, an alarm is launched when the loss increases significantly, which triggers some adaptation mechanism (e.g. retrain the model). However, this modus...
We study the effectiveness of replacing the split strategy for the state-of-the-art online tree learner, Hoeffding Tree, with a rigorous but more eager splitting strategy. Our method, Hoeffding AnyTime Tree (HATT), uses the Hoeffding Test to determine whether the current best candidate split is superior to the current split, with the possibility of...
Concept drift detection is a crucial task in data stream evolving environments. Most of the state of the art approaches designed to tackle this problem monitor the loss of predictive models. Accordingly, an alarm is launched when the loss increases significantly, which triggers some adaptation mechanism (e.g. retrain the model). However, this modus...
A dynamic attributed graph is a graph that changes over time and where each vertex is described using multiple continuous attributes. Such graphs are found in numerous domains, e.g., social network analysis. Several studies have been done on discovering patterns in dynamic attributed graphs to reveal how attribute(s) change over time. However, many...
For many streaming classification tasks, the ground truth labels become available with a non-negligible latency. Given this delayed labelling setting, after the instance data arrives and before its true label is known, the online classifier model may change. Hence, the initial prediction can be replaced with additional periodic predictions graduall...
An ensemble of learners tends to exceed the pre-dictive performance of individual learners. This approach has been explored for both batch and online learning. Ensembles methods applied to data stream classification were thoroughly investigated over the years, while their regression counterparts received less attention in comparison. In this work,...
Mining high-dimensional data streams poses a fundamental challenge to machine learning as the presence of high numbers of attributes can remarkably degrade any mining task's performance. In the past several years, dimension reduction (DR) approaches have been successfully applied for different purposes (e.g., visualization). Due to their high-compu...
Assigning scores to individual features is a popular method for estimating the relevance of features in supervised learning. An accurate feature score estimation provides essential insights in sensitive domains, which is decisive to explain how features influence a given decision, contributing to the inter-pretability of the model. Learning from st...
Ensemble classifiers are a promising approach for data stream classification. Though, diversity influences the performance of ensemble classifiers, current studies do not take advantage of relations between component classifiers to improve their performance. This paper addresses this issue by proposing a new kind of ensemble learner for data stream...
The use of Machine Learning (ML) techniques for malware detection has been a trend in the last two decades. More recently, researchers started to investigate adversarial approaches to bypass these ML-based malware detectors. Adversarial attacks became so popular that a large Internet company has launched a public challenge to encourage researchers...
The Publisher regrets an error in the spelling of the family name of the sixth author. The correct spelling is Bernhard Pfahringer, as it appears in the author list above.
Trust mechanisms are considered the logical protection of software systems, preventing malicious people from taking advantage or cheating others. Although these concepts are widely used, most applications in this field do not consider affective aspects to aid in trust computation. Researchers of Psychology, Neurology, Anthropology, and Computer Sci...