Eibe Frank's research while affiliated with The University of Waikato and other places

Publications (193)

Article
Proper management of the earth's natural resources is imperative to combat further degradation of the natural environment. However, the environmental datasets necessary for informed resource planning and conservation can be costly to collect and annotate. Consequently, there is a lack of publicly available datasets, particularly annotated image dat...
Preprint
Full-text available
Cross-domain few-shot meta-learning (CDFSML) addresses learning problems where knowledge needs to be transferred from several source domains into an instance-scarce target domain with an explicitly different input distribution. Recently published CDFSML methods generally construct a "universal model" that combines knowledge of multiple source domai...
Article
Full-text available
SHapley Additive exPlanation (SHAP) values (Lundberg & Lee, 2017) provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values (Shapley, 1953). While exact calculation of SHAP values is computationally intractable in general, a recursive polynomial-time algorithm called TreeShap (Lundberg et al., 202...
Article
Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high late...
Chapter
Many explanation methods have been proposed to reveal insights about the internal procedures of black-box models like deep neural networks. Although these methods are able to generate explanations for individual predictions, little research has been conducted to investigate the relationship of model accuracy and explanation quality, or how to use e...
Preprint
Full-text available
Deep neural networks produce state-of-the-art results when trained on a large number of labeled examples but tend to overfit when small amounts of labeled examples are used for training. Creating a large number of labeled examples requires considerable resources, time, and effort. If labeling new data is not feasible, so-called semi-supervised lear...
Preprint
Full-text available
Neural networks have been successfully used as classification models yielding state-of-the-art results when trained on a large number of labeled samples. These models, however, are more difficult to train successfully for semi-supervised problems where small amounts of labeled instances are available along with a large number of unlabeled instances...
Preprint
Full-text available
Self-training is a simple semi-supervised learning approach: Unlabelled examples that attract high-confidence predictions are labelled with their predictions and added to the training set, with this process being repeated multiple times. Recently, self-supervision -- learning without manual supervision by solving an automatically-generated pretext...
Preprint
Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies. Moreover, existing parallel shuffling algorithms show unsatisfactory performance on GPU architectures because they incur a large number of read/write o...
Preprint
Full-text available
Game-theoretic attribution techniques based on Shapley values are used extensively to interpret black-box machine learning models, but their exact calculation is generally NP-hard, requiring approximation methods for non-trivial models. As the computation of Shapley values can be expressed as a summation over a set of permutations, a common approac...
Article
We empirically evaluate lightweight moment estimators for the single-pass quantile approximation problem, including maximum entropy methods and orthogonal series with Fourier, Cosine, Legendre, Chebyshev and Hermite basis functions. We show how to apply stable summation formulas to offset numerical precision issues for higher-order moments, leading...
Article
The family of methods collectively known as classifier chains has become a popular approach to multi-label learning problems. This approach involves chaining together off-the-shelf binary classifiers in a directed structure, such that individual label predictions become features for other classifiers. Such methods have proved flexible and effective...
Article
Full-text available
We investigate the effect of explicitly enforcing the Lipschitz continuity of neural networks. Our main hypothesis is that constraining the Lipschitz constant of a networks will have a regularising effect. To this end, we provide a simple technique for computing the Lipschitz constant of a feed forward neural network composed of commonly used layer...
Chapter
We present an empirical evaluation of machine learning algorithms in cross-domain few-shot learning based on a fixed pre-trained feature extractor. Experiments were performed in five target domains (CropDisease, EuroSAT, Food101, ISIC and ChestX) and using two feature extractors: a ResNet10 model trained on a subset of ImageNet known as miniImageNe...
Chapter
Deep neural networks produce state-of-the-art results when trained on a large number of labeled examples but tend to overfit when small amounts of labeled examples are used for training. Creating a large number of labeled examples requires considerable resources, time, and effort. If labeling new data is not feasible, so-called semi-supervised lear...
Preprint
SHAP (SHapley Additive exPlanation) values provide a game theoretic interpretation of the predictions of machine learning models based on Shapley values. While SHAP values are intractable in general, a recursive polynomial time algorithm specialised for decision tree models is available, named TreeShap. Despite its polynomial time complexity, TreeS...
Preprint
Full-text available
Boosting is an ensemble method that combines base models in a sequential manner to achieve high predictive accuracy. A popular learning algorithm based on this ensemble method is eXtreme Gradient Boosting (XGB). We present an adaptation of XGB for classification of evolving data streams. In this setting, new data arrives over time and the relations...
Preprint
Full-text available
Automatic source code analysis in key areas of software engineering, such as code security, can benefit from Machine Learning (ML). However, many standard ML approaches require a numeric representation of data and cannot be applied directly to source code. Thus, to enable ML, we need to embed source code into numeric feature vectors while maintaini...
Preprint
Full-text available
The family of methods collectively known as classifier chains has become a popular approach to multi-label learning problems. This approach involves linking together off-the-shelf binary classifiers in a chain structure, such that class label predictions become features for other classifiers. Such methods have proved flexible and effective and have...
Chapter
Neural networks have been successfully used as classification models yielding state-of-the-art results when trained on a large number of labeled samples. These models, however, are more difficult to train successfully for semi-supervised problems where small amounts of labeled instances are available along with a large number of unlabeled instances...
Article
Full-text available
AffectiveTweets is a set of programs for analyzing emotion and sentiment of social media messages such as tweets. It is implemented as a package for the Weka machine learning workbench and provides methods for calculating state-of-the-art affect analysis features from tweets that can be fed into machine learning algorithms implemented in Weka. It a...
Article
Full-text available
Deep learning is a branch of machine learning that generates multi-layered representations of data, commonly using artificial neural networks, and has improved the state-of-the-art in various machine learning tasks (e.g., image classification, object detection, speech recognition, and document classification). However, most popular deep learning fr...
Chapter
A system of nested dichotomies (NDs) is a method of decomposing a multiclass problem into a collection of binary problems. Such a system recursively applies binary splits to divide the set of classes into two subsets, and trains a binary classifier for each split. Many methods have been proposed to perform this split, each with various advantages a...
Chapter
Nested dichotomies (NDs) are used as a method of transforming a multiclass classification problem into a series of binary problems. A tree structure is induced that recursively splits the set of classes into subsets, and a binary classification model learns to discriminate between the two subsets of classes at each node. In this paper, we demonstra...
Preprint
Full-text available
We present an online algorithm that induces decision trees using gradient information as the source of supervision. In contrast to previous approaches to gradient-based tree learning, we do not require soft splits or construction of a new tree for every update. In experiments, our method performs comparably to standard incremental classification tr...
Chapter
Effective regularisation of neural networks is essential to combat overfitting due to the large number of parameters involved. We present an empirical analogue to the Lipschitz constant of a feed-forward neural network, which we refer to as the maximum gain. We hypothesise that constraining the gain of a network will have a regularising effect, sim...
Preprint
Full-text available
A system of nested dichotomies is a method of decomposing a multi-class problem into a collection of binary problems. Such a system recursively applies binary splits to divide the set of classes into two subsets, and trains a binary classifier for each split. Many methods have been proposed to perform this split, each with various advantages and di...
Preprint
Full-text available
Nested dichotomies are used as a method of transforming a multiclass classification problem into a series of binary problems. A tree structure is induced that recursively splits the set of classes into subsets, and a binary classification model learns to discriminate between the two subsets of classes at each node. In this paper, we demonstrate tha...
Preprint
Full-text available
Relay attacks are passive man in the middle attacks, aiming to extend the physical distance of devices involved in a transaction beyond their operating environment, within the restricted time-frame. In the field of smartphones, proposals have been put forward suggesting sensing the natural ambient environment as an effective Proximity and Relay Att...
Preprint
Full-text available
Obtaining accurate and well calibrated probability estimates from classifiers is useful in many applications, for example, when minimising the expected cost of classifications. Existing methods of calibrating probability estimates are applied globally, ignoring the potential for improvements by applying a more fine-grained model. We propose probabi...
Preprint
We describe the multi-GPU gradient boosting algorithm implemented in the XGBoost library (https://github.com/dmlc/xgboost). Our algorithm allows fast, scalable training on multi-GPU systems with all of the features of the XGBoost library. We employ data compression techniques to minimise the usage of scarce GPU memory while still allowing highly ef...
Article
Full-text available
We address the problem of estimating discrete, continuous, and conditional joint densities online, i.e., the algorithm is only provided the current example and its current estimate for its update. The family of proposed online density estimators, estimation of densities online (EDO), uses classifier chains to model dependencies among features, wher...
Article
Full-text available
Message-level and word-level polarity classification are two popular tasks in Twitter sentiment analysis. They have been commonly addressed by training supervised models from labelled data. The main limitation of these models is the high cost of data annotation. Transferring existing labels from a related problem domain is one possible solution for...
Article
Full-text available
Effective regularisation of neural networks is essential to combat overfitting due to the large number of parameters involved. We present an empirical analogue to the Lipschitz constant of a feed-forward neural network, which we refer to as the maximum gain. We hypothesise that constraining the gain of a network will have a regularising effect, sim...
Conference Paper
Full-text available
Accounting for misclassification costs is important in many practical applications of machine learning, and cost-sensitive techniques for classification have been studied extensively. Utility-based learning provides a generalization of purely cost-based approaches that considers both costs and benefits, enabling application to domains with complex...
Conference Paper
Full-text available
Smartphones with Near-Field Communication (NFC) may emulate contactless smart cards, which has resulted in the deployment of various access control, transportation and payment services, such as Google Pay and Apple Pay. Like contactless cards, however, NFC-based smartphone transactions are susceptible to relay attacks, and ambient sensing has been...
Article
Full-text available
Can we evolve better training data for machine learning algorithms? We show that we can enhance the generalisation performance of naive Bayes for regression models by generating artificial surrogate training data using population-based optimisation. These results are important because (i) naive Bayes models are simple and interpretable, but frequen...
Conference Paper
The crowd-sourced Naturewatch GBIF dataset is used to obtain a species classification dataset containing approximately 1.2 million photos of nearly 20 thousand different species of biological organisms observed in their natural habitat. We present a general hierarchical species identification system based on deep convolutional neural networks train...
Conference Paper
Full-text available
Relay attacks are passive man-in-the-middle attacks that aim to extend the physical distance of devices involved in a transaction beyond their operating environment. In the field of smart cards, distance bounding protocols have been proposed in order to counter relay attacks. For smartphones, meanwhile, the natural ambient environment surrounding t...
Conference Paper
Full-text available
Near Field Communication (NFC) has enabled mobile phones to emulate contactless smart cards. Similar to contactless smart cards, they are also susceptible to relay attacks. To counter these, a number of methods have been proposed that rely primarily on ambient sensors as a proximity detection mechanism (also known as an anti-relay mechanism). In th...
Preprint
Full-text available
We present a CUDA based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelized, combining...
Preprint
Full-text available
We present a CUDA based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelized, combining...
Chapter
A powerful way to improve performance in machine learning is to combine the predictions of multiple models. This involves constructing an ensemble of classifiers—e.g., a set of decision trees rather than a single tree. We begin by describing bagging and randomization, which both use a single learning algorithm to generate an ensemble predictor. Bag...
Chapter
Machine learning requires something to learn from: data. This chapter explains what kind of structure is required in the input data when applying the machine learning techniques covered in the book, and establishes the terminology that will be used. First, we explain what is meant by learning a concept from data, and describe the types of machine l...
Chapter
This book is about machine learning techniques for data mining. We start by explaining what people mean by data mining and machine learning, and give some simple example machine learning problems, including both classification and numeric prediction tasks, to illustrate the kinds of input and output involved. To demonstrate the wide applicability o...
Chapter
We begin by revisiting the basic instance-based learning method of nearest-neighbor classification and considering how it can be made more robust and storage efficient by generalizing both exemplars and distance functions. We then discuss two well-known approaches for generalizing linear models that go beyond modeling linear relationships between t...
Chapter
s There are many transformations that can make real-world datasets more amenable to the learning algorithms discussed in the rest of the book. We first consider methods for attribute selection, which remove attributes that are not useful for the task at hand. Then we look at discretization methods: algorithms for turning numeric attributes into dis...
Chapter
s Now we plunge into the world of actual machine learning algorithms. This chapter only considers basic, principled, versions of learning algorithms, leaving advanced features that are necessary for real-world deployment for later. A rudimentary rule learning algorithm simply picks a single attribute to make predictions; the well-known “Naive Bayes...
Chapter
In many practical applications, labeled data is rare or costly to obtain. “Semisupervised” learning exploits unlabeled data to improve the performance of supervised learning. We first discuss how to combine clustering with classification; more specifically, how mixture model clustering using expectation maximization can be combined with Naive Bayes...
Chapter
Having examined the input to machine learning, we move on to review the types of output that can be generated. We first discuss decision tables, which are perhaps the most basic form of knowledge representation, before considering linear models such as those produced by linear regression. Next we explain decision trees, the most widely used kind of...
Chapter
One of the most striking contemporary developments in the field of machine learning is the meteoric ascent of what has been called “deep learning,” and this area is now at the forefront of current research. We discuss key differences between traditional neural network architectures and learning techniques, and those that have become popular in deep...
Chapter
This chapter explains practical decision tree and rule learning methods, and also considers more advanced approaches for generating association rules. The basic algorithms for learning classification trees and rules presented in Chapter 4, Algorithms: the basic methods, are extended to make them applicable to real-world problems that contain numeri...
Chapter
Full-text available
Near Field Communication (NFC) has enabled mobile phones to emulate contactless smart cards. Similar to contactless smart cards, they are also susceptible to relay attacks. To counter these, a number of methods have been proposed that rely primarily on ambient sensors as a proximity detection mechanism (also known as an anti-relay mechanism). In th...
Book
Data Mining: Practical Machine Learning Tools and Techniques, Fourth Edition, offers a thorough grounding in machine learning concepts, along with practical advice on applying these tools and techniques in real-world data mining situations. This highly anticipated fourth edition of the most acclaimed work on data mining and machine learning teaches...
Conference Paper
Full-text available
The automatic detection of emotions in Twitter posts is a challenging task due to the informal nature of the language used in this platform. In this paper, we propose a methodology for expanding the NRC word-emotion association lexicon for the language used in Twitter. We perform this expansion using multi-label classification of words and compare...
Conference Paper
Full-text available
Message-level and word-level polarity classification are two popular tasks in Twitter sentiment analysis. They have been commonly addressed by training supervised models from labelled data. The main limitation of these models is the high cost of data annotation. Transferring existing labels from a related problem domain is one possible solution for...
Conference Paper
A system of nested dichotomies is a method of decomposing a multi-class problem into a collection of binary problems. Such a system recursively applies binary splits to divide the set of classes into two subsets, and trains a binary classifier for each split. Although ensembles of nested dichotomies with random structure have been shown to perform...