About
126
Publications
13,666
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,194
Citations
Current institution
afsd
Publications
Publications (126)
Feature selection is the data analysis process that selects a smaller and curated subset of the original dataset by filtering out data (features) which are irrelevant or redundant. The most important features can be ranked and selected based on statistical measures, such as mutual information. Feature selection not only reduces the size of dataset...
A typical machine learning development cycle maximizes performance during model training and then minimizes the memory and area footprint of the trained model for deployment on processing cores, graphics processing units, microcontrollers or custom hardware accelerators. However, this becomes increasingly difficult as machine learning models grow l...
This work presents a novel algorithm for transforming a neural network into a spline representation. Unlike previous work that required convex and piecewise-affine network operators to create a max-affine spline alternate form, this work relaxes this constraint. The only constraint is that the function be bounded and possess a well-define second de...
We present a theory of ensemble diversity, explaining the nature and effect of diversity for a wide range of supervised learning scenarios. This challenge, of understanding ensemble diversity, has been referred to as the holy grail of ensemble learning, an open question for over 30 years. Our framework reveals that diversity is in fact a hidden dim...
End-to-End training (E2E) is becoming more and more popular to train complex Deep Network architectures. An interesting question is whether this trend will continue—are there any clear failure cases for E2E training? We study this question in depth, for the specific case of E2E training an ensemble of networks. Our strategy is to blend the gradient...
There was a mistake in the proof of the optimal shrinkage intensity for our estimator presented in Section 3.1.
Since wearable computing systems have grown in importance in the last years, there is an increased interest in implementing machine learning algorithms with reduced precision parameters/computations. Not only learning, also feature selection, most of the times a mandatory preprocessing step in machine learning, is often constrained by the available...
Feature selection is central to modern data science. The ‘stability’ of a feature selection algorithm refers to the sensitivity of its choices to small changes in training data. This is, in effect, the robustness of the chosen features. This paper considers the estimation of stability when we expect strong pairwise correlations, otherwise known as...
The ultimate goal of a supervised learning algorithm is to produce models constructed on the training data that can generalize well to new examples. In classification, functional margin maximization -- correctly classifying as many training examples as possible with maximal confidence --has been known to construct models with good generalization gu...
Probability estimates generated by boosting ensembles are poorly calibrated because of the margin maximization nature of the algorithm. The outputs of the ensemble need to be properly calibrated before they can be used as probability estimates. In this work, we demonstrate that online boosting is also prone to producing distorted probability estima...
Information theoretic feature selection methods quantify the importance of each feature by estimating mutual information terms to capture: the relevancy, the redundancy and the complementarity. These terms are commonly estimated by maximum likelihood, while an under-explored area of research is how to use shrinkage methods instead. Our work suggest...
Recent work has integrated semantics into the 3D scene models produced by visual SLAM systems. Though these systems operate close to real time, there is lacking a study of the ways to achieve real-time performance by trading off between semantic model accuracy and computational requirements. ORB-SLAM2 provides good scene accuracy and real-time proc...
Related code is available at https://github.com/grey-area/modular-loss-experiments
We examine the practice of joint training for neural network ensembles, in which a multi-branch architecture is trained via single loss. This approach has recently gained traction, with claims of greater accuracy per parameter along with increased parallelism. We in...
We examine the practice of joint training for neural network ensembles, in which a multi-branch architecture is trained via single loss. This approach has recently gained traction, with claims of greater accuracy per parameter along with increased parallelism. We introduce a family of novel loss functions generalizing multiple previously proposed a...
We introduce an approach to modular dimensionality reduction, allowing efficient learning of multiple complementary representations of the same object. Modules are trained by optimising an unsupervised cost function which balances two competing goals: Maintaining the inner product structure within the original space, and encouraging structural dive...
Deep learning systems can be fooled by small, worst-case perturbations of their inputs, known as adversarial examples. This has been almost exclusively studied in supervised learning, on vision tasks. However, adversarial examples in counterfactual modelling, which sits outside the traditional supervised scenario, is an overlooked challenge. We int...
Motivation:
The identification of biomarkers to support decision-making is central to personalized medicine, in both clinical and research scenarios. The challenge can be seen in two halves: identifying predictive markers, which guide the development/use of tailored therapies; and identifying prognostic markers, which guide other aspects of care a...
In an era in which the volume and complexity of datasets is continuously growing, feature selection techniques have become indispensable to extract useful information from huge amounts of data. However, existing algorithms may not scale well when dealing with huge datasets, and a possible solution is to distribute the data in several nodes. In this...
We study the approximate nearest neighbour method for cost-sensitive classification on low-dimensional manifolds embedded within a high-dimensional feature space. We determine the minimax learning rates for distributions on a smooth manifold, in a cost-sensitive setting. This generalises a classic result of Audibert and Tsybakov. Building upon rece...
In this paper we propose and explore the k-Nearest Neighbour UCB algorithm for multi-armed bandits with covariates. We focus on a setting where the covariates are supported on a metric space of low intrinsic dimension, such as a manifold embedded within a high dimensional ambient feature space. The algorithm is conceptually simple and straightforwa...
Ensemble methods are a cornerstone of modern machine learning. The performance of an ensemble depends crucially upon the level of diversity between its constituent learners. This paper establishes a connection between diversity and degrees of freedom (i.e. the capacity of the model), showing that diversity may be viewed as a form of inverse regular...
What is the simplest thing you can do to solve a problem? In the context of semi-supervised feature selection, we tackle exactly this—how much we can gain from two simple classifier-independent strategies. If we have some binary labelled data and some unlabelled, we could assume the unlabelled data are all positives, or assume them all negatives. T...
Ensemble methods are a cornerstone of modern machine learning. The performance of an ensemble depends crucially upon the level of diversity between its constituent learners. This paper establishes a connection between diversity and degrees of freedom (i.e. the capacity of the model), showing that diversity may be viewed as a form of inverse regular...
The energy yield estimation of a photovoltaic (PV) system operating under partially shaded conditions is a challenging task and a very active area of research. In this paper, we attack this problem with the aid of machine learning techniques. Using data simulated by the equivalent circuit of a PV string operating under partial shading, we train and...
Producing stable feature rankings is critical in many areas, such as in bioinformatics where the robustness of a list of ranked genes is crucial to interpretation by a domain expert. In this paper, we study Spearman’s rho as a measure of stability to training data perturbations - not just as a heuristic, but here proving that it is the natural meas...
Under-reporting occurs in survey data when there is a reason for participants to give a false negative response to a question, e.g. maternal smoking in epidemiological studies. Failing to correct this misreporting introduces biases and it may lead to misinformed decision making. Our work provides methods of correcting for this bias, by reinterpreti...
In this paper we describe Jacc, an experimental framework which allows developers to program GPGPUs directly from Java. The goal of Jacc, is to allow developers to benefit from using heterogeneous hardware whilst minimizing the amount of code refactoring required. Jacc utilizes two key abstractions: tasks which encapsulate all the information neede...
We study information theoretic methods for ranking biomarkers. In clinical trials, there are two, closely related, types of biomarkers: predictive and prognostic, and disentangling them is a key challenge. Our first step is to phrase biomarker ranking in terms of optimizing an information theoretic quantity. This formalization of the problem will e...
We study information theoretic methods for ranking biomarkers. In clinical trials there are two, closely related, types of biomarkers: predictive and prognostic, and disentangling them is a key challenge. Our first step is to phrase biomarker ranking in terms of optimizing an information theoretic quantity. This formalization of the problem will en...
Under-reporting occurs in survey data when there is a reason to systematically misreport the response to a question. For example, in studies dealing with low birth weight infants, the smoking habits of the mother are very likely to be misreported. This creates problems for calculating effect sizes, such as bias, but these problems are commonly igno...
In feature selection algorithms, “stability” is the sensitivity of the chosen feature set to variations in the supplied training data. As such it can be seen as an analogous concept to the statistical variance of a predictor. However unlike variance, there is no unique definition of stability, with numerous proposed measures over 15 years of litera...
We provide a unifying perspective for two decades of work on cost-sensitive Boosting algorithms. When analyzing the literature 1997–2016, we find 15 distinct cost-sensitive variants of the original algorithm; each of these has its own motivation and claims to superiority—so who should we believe? In this work we critique the Boosting literature usi...
Current parallelizing compilers can tackle applications exercising regular access patterns on arrays or affine indices, where data dependencies can be expressed in a linear form. Unfortunately, there are cases that independence between statements of code cannot be guaranteed and thus the compiler conservatively produces sequential code. Programs th...
We introduce the concept of a Modular Autoencoder (MAE), capable of learning
a set of diverse but complementary representations from unlabelled data, that
can later be used for supervised tasks. The learning of the representations is
controlled by a trade off parameter, and we show on six benchmark datasets the
optimum lies between two extremes: a...
With the growth of high dimensional data, feature selection is a vital component of machine learning as well as an important stand alone data analytics tool. Without it, the computation cost of big data analytics can become unmanageable and spurious correlations and noise can reduce the accuracy of any results. Feature selection removes irrelevant...
Automated acquisition, or learning, of ontologies has attracted research attention because it can help ontology engineers build ontologies and give domain experts new insights into their data. However, existing approaches to on-tology learning are considerably limited, e.g. focus on learning descriptions for given classes, require intense supervisi...
Automated acquisition, or learning, of ontologies has attracted research attention because it can help ontology engineers build ontologies and give domain experts new insights into their data. However, existing approaches to on-tology learning are considerably limited, e.g. focus on learning descriptions for given classes, require intense supervisi...
Automated acquisition, or learning, of ontologies has attracted research attention because it can help ontology engineers build ontologies and give domain experts new insights into their data. However, existing approaches to on-tology learning are considerably limited, e.g. focus on learning descriptions for given classes, require intense supervisi...
The importance of Markov blanket discovery algorithms is twofold: as the main building block in constraint-based structure learning of Bayesian network algorithms and as a technique to derive the optimal set of features in filter feature selection approaches. Equally, learning from partially labelled data is a crucial and demanding area of machine...
Heterogeneous programming has started becoming the norm in order to achieve
better performance by running portions of code on the most appropriate hardware
resource. Currently, significant engineering efforts are undertaken in order to
enable existing programming languages to perform heterogeneous execution mainly
on GPUs. In this paper we describe...
Ensemble methods are often used to decide on a good selection of features for later processing by a classifier. Examples of this are in the determination of Random Forest variable importance proposed by Breiman, and in the concept of feature selection ensembles, where the outputs of multiple feature selectors are combined to yield more robust resul...
Asymmetric classification problems are characterized by class imbalance or unequal costs for different types of misclassifications. One of the main cited weaknesses of AdaBoost is its perceived inability to handle asymmetric problems. As a result, a multitude of asymmetric versions of AdaBoost have been proposed, mainly as heuristic modifications t...
We study a dichotomy of scientific styles, unifying and diversifying, as proposed by Freeman J. Dyson. We discuss the extent to which the dichotomy transfers from the natural sciences (where Dyson proposed it) to the field of Pattern Recognition. To address this we must firstly ask what it means to be a “unifier” or “diversifier” in a field, and wh...
We propose a set of novel methodologies which enable valid statistical hypothesis testing when we have only positive and unlabelled (PU) examples. This type of problem, a special case of semi-supervised data, is common in text mining, bioinformatics, and computer vision. Focusing on a generalised likelihood ratio test, we have 3 key contributions:...
In this paper we present a framework to unify information theoretic feature selection criteria for multi-label data. Our framework combines two different ideas; expressing multi-label decomposition methods as composite likelihoods and then showing how feature selection criteria can be derived by maximizing these likelihood expressions. Many existin...
In this paper, we present a novel ensemble method random projection random discretization ensembles(RPRDE) to create ensembles of linear multivariate decision trees by using a univariate decision tree algorithm. The present method combines the better computational complexity of a univariate decision tree algorithm with the better representational p...
As data sets become ever larger it becomes increasingly complex to apply traditional machine learning techniques to them. Feature selection can greatly reduce the computational requirements of machine learning but it too can be memory intensive. In this paper we explore the use of succinct data structures called sketches for probability estimation...
Fano's inequality lower bounds the probability of transmission error through a communication channel. Applied to classification problems, it provides a lower bound on the Bayes error rate and motivates the widely used Infomax principle. In modern machine learning, we are often interested in more than just the error rate. In medical diagnosis, diffe...
Thread-Level Speculation (TLS) overcomes limitations intrinsic with conservative compile-time auto-parallelizing tools by extracting parallel threads optimistically and only ensuring absence of data dependence violations at runtime.
A significant barrier for adopting TLS (implemented in software) is the overheads associated with maintaining specul...
We present a unifying framework for information theoretic feature selection, bringing almost two decades of research on heuristic filter criteria under a single theoretical interpretation. This is in response to the question: "what are the implicit statistical assumptions of feature selection criteria based on mutual information?". To answer this,...
MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures....
UCS is a Learning Classifier System (LCS) which evolves condition-action rules for supervised classification tasks. In UCS the fitness of a rule is based on its accuracy raised to a power ν, and this fitness is used in both the search for good rules (via a genetic algorithm) and in a classification vote. We trace the origin of the UCS fitness funct...
In recent years there have been efforts to develop a probabilistic framework to explain the workings of a Learning Classifier System. This direction of research has met with limited success due to the intractability of complicated heuristic training rules used by the learning classifier systems. In this paper, we derive a learning classifier system...
Thread-Level Speculation (TLS) facilitates the extraction of parallel threads from sequential applications. Most prior work has focused on developing the compiler and architecture for this execution paradigm. Such studies often narrowly concentrated on a specific design point. On the other hand, other studies have attempted to assess how well TLS p...
We introduce Learn++.MF, an ensemble-of-classifiers based algorithm that employs random subspace selection to address the missing feature problem in supervised classification. Unlike most established approaches, Learn++.MF does not replace missing values with estimated ones, and hence does not need specific assumptions on the underlying data distri...
Fundamental nano-patterns are simple, static, binary properties of Java methods, such as ObjectCreator and Recursive. We present a provisional catalogue of 17 such nano-patterns. We report statistical and information theoretic metrics to show the frequency of nano-pattern occurrence in a large corpus of open-source Java projects. We proceed to give...
This paper argues that economic theory can improve our understanding of memory management. We introduce the allocation curve, as an analogue of the demand curve from microeconomics. An allocation curve for a program characterises how the amount of garbage collection activity required during its execution varies in relation to the heap size associat...
Oza's Online Boosting algorithm provides a version of Ad- aBoost which can be trained in an online way for stationary problems. One perspective is that this enables the power of the boosting frame- work to be applied to datasets which are too large to t into memory. The online boosting algorithm assumes the data distribution to be inde- pendent and...
Although diversity in classifier ensembles is desirable, its relationship with the ensemble accuracy is not straightforward. Here we derive a decomposition of the majority vote error into three terms: average individual accuracy, “good” diversity and “bad diversity”. The good diversity term is taken out of the individual error whereas the bad diver...
Thread-level speculation (TLS) is widely accepted to be a realistic model for execution of sequential pro-grams on multi-core architectures. One problem with TLS occurs when large numbers of spawned threads are squashed due to data dependence violations. This can reduce or entirely obliterate the performance ben-efits of TLS. Until now, informal co...