Machine Learning

Published by Springer Nature
Online ISSN: 1573-0565
Print ISSN: 0885-6125
Learn more about this page
Recent publications
  • Carlos Ortega Vázquez
    Carlos Ortega Vázquez
  • Seppe vanden Broucke
    Seppe vanden Broucke
  • Jochen De Weerdt
    Jochen De Weerdt
Learning from positive and unlabeled data, or PU learning, is the setting in which a binary classifier can only train from positive and unlabeled instances, the latter containing both positive as well as negative instances. Many PU applications, e.g., fraud detection, are also characterized by class imbalance, which creates a challenging setting. Not only are fewer minority class examples compared to the case where all labels are known, there is also only a small fraction of unlabeled observations that would actually be positive. Despite the relevance of the topic, only a few studies have considered a class imbalance setting in PU learning. In this paper, we propose a novel technique that can directly handle imbalanced PU data, named the PU Hellinger Decision Tree (PU-HDT). Our technique exploits the class prior to estimate the counts of positives and negatives in every node in the tree. Moreover, the Hellinger distance is used instead of more conventional splitting criteria because it has been shown to be class-imbalance insensitive. This simple yet effective adaptation allows PU-HDT to perform well in highly imbalanced PU data sets. We also introduce PU Stratified Hellinger Random Forest (PU-SHRF), which uses PU-HDT as its base learner and integrates a stratified bootstrap sampling. Our empirical analysis shows that PU-SHRF substantially outperforms state-of-the-art PU learning methods for imbalanced data sets in most experimental settings.
 
  • Sofie Goethals
    Sofie Goethals
  • David Martens
    David Martens
  • Toon Calders
    Toon Calders
This paper studies how counterfactual explanations can be used to assess the fairness of a model. Using machine learning for high-stakes decisions is a threat to fairness as these models can amplify bias present in the dataset, and there is no consensus on a universal metric to detect this. The appropriate metric and method to tackle the bias in a dataset will be case-dependent, and it requires insight into the nature of the bias first. We aim to provide this insight by integrating explainable AI (XAI) research with the fairness domain. More specifically, apart from being able to use (Predictive) Counterfactual Explanations to detect explicit bias when the model is directly using the sensitive attribute, we show that it can also be used to detect implicit bias when the model does not use the sensitive attribute directly but does use other correlated attributes leading to a substantial disadvantage for a protected group. We call this metric PreCoF, or Predictive Counterfactual Fairness. Our experimental results show that our metric succeeds in detecting occurrences of implicit bias in the model by assessing which attributes are more present in the explanations of the protected group compared to the unprotected group. These results could help policymakers decide on whether this discrimination is justified or not.
 
  • Jialiang Shen
    Jialiang Shen
  • Yu Yao
    Yu Yao
  • Shaoli Huang
    Shaoli Huang
  • [...]
  • Tongliang Liu
    Tongliang Liu
Deep models trained by using clean data have achieved tremendous success in fine-grained image classification. Yet, they generally suffer from significant performance degradation when encountering noisy labels. Existing approaches to handle label noise, though proved to be effective for generic object recognition, usually fail on fine-grained data. The reason is that, on fine-grained data, the category difference is subtle and the training sample size is small. Then deep models could easily overfit the noisy labels. To improve the robustness of deep models on noisy data for fine-grained visual categorization, in this paper, we propose a novel learning framework named ProtoSimi. Our method employs an adaptive label correction strategy, ensuring effective learning on limited data. Specifically, our approach considers the criteria of exploring the effectiveness of both global class-prototype and part class-prototype similarities in identifying and correcting labels of samples. We evaluate our method on three standard benchmarks of fine-grained recognition. Experimental results show that our method outperforms the existing label noisy methods by a large margin. In ablation studies, we also verify that our method is non-sensitive to hyper-parameters selection and can be integrated with other FGVC methods to increase the generalization performance.
 
A magic value in a program is a constant symbol that is essential for the execution of the program but has no clear explanation for its choice. Learning programs with magic values is difficult for existing program synthesis approaches. To overcome this limitation, we introduce an inductive logic programming approach to efficiently learn programs with magic values. Our experiments on diverse domains, including program synthesis, drug design, and game playing, show that our approach can (1) outperform existing approaches in terms of predictive accuracies and learning times, (2) learn magic values from infinite domains, such as the value of pi, and (3) scale to domains with millions of constant symbols.
 
Previous studies have shown that universal adversarial attacks can fool deep neural networks over a large set of input images with a single human-invisible perturbation. However, current methods for universal adversarial attacks are based on additive perturbation, which enables misclassification by directly adding the perturbation on the input images. In this paper, for the first time, we show that a universal adversarial attack can also be achieved through spatial transformation (non-additive). More importantly, to unify both additive and non-additive perturbations, we propose a novel unified yet flexible framework for universal adversarial attacks, called GUAP, which can initiate attacks by ℓ∞\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\ell _\infty$$\end{document}-norm (additive) perturbation, spatially-transformed (non-additive) perturbation, or a combination of both. Extensive experiments are conducted on two computer vision scenarios, including image classification and semantic segmentation tasks, which contain CIFAR-10, ImageNet and Cityscapes datasets with a number of different deep neural network models, including GoogLeNet, VGG16/19, ResNet101/152, DenseNet121, and FCN-8s. Empirical experiments demonstrate that GUAP can obtain higher attack success rates on these datasets compared to state-of-the-art universal adversarial attacks. In addition, we also demonstrate how universal adversarial training benefits the robustness of the model against universal attacks. We release our tool GUAP on https://github.com/TrustAI/GUAP.
 
Deep learning methods are recognised as state-of-the-art for many applications of machine learning. Recently, deep learning methods have emerged as a solution to the task of automatic music generation (AMG) using symbolic tokens in a target style, but their superiority over non-deep learning methods has not been demonstrated. Here, we conduct a listening study to comparatively evaluate several music generation systems along six musical dimensions: stylistic success, aesthetic pleasure, repetition or self-reference, melody, harmony, and rhythm. A range of models, both deep learning algorithms and other methods, are used to generate 30-s excerpts in the style of Classical string quartets and classical piano improvisations. Fifty participants with relatively high musical knowledge rate unlabelled samples of computer-generated and human-composed excerpts for the six musical dimensions. We use non-parametric Bayesian hypothesis testing to interpret the results, allowing the possibility of finding meaningful non-differences between systems’ performance. We find that the strongest deep learning method, a reimplemented version of Music Transformer, has equivalent performance to a non-deep learning method, MAIA Markov, demonstrating that to date, deep learning does not outperform other methods for AMG. We also find there still remains a significant gap between any algorithmic method and human-composed excerpts.
 
Deep reinforcement learning agents are vulnerable to adversarial attacks. In particular, recent studies have shown that attacking a few key steps can effectively decrease the agent’s cumulative reward. However, all existing attacking methods define those key steps with human-designed heuristics, and it is not clear how more effective key steps can be identified. This paper introduces a novel reinforcement learning framework that learns key steps through interacting with the agent. The proposed framework does not require any human heuristics nor knowledge, and can be flexibly coupled with any white-box or black-box adversarial attack scenarios. Experiments on benchmark Atari games across different scenarios demonstrate that the proposed framework is superior to existing methods for identifying effective key steps. The results highlight the weakness of RL agents even under budgeted attacks.
 
Autonomous robots start to be integrated in human environments where explicit and implicit social norms guide the behavior of all agents. To assure safety and predictability, these artificial agents should act in accordance with the applicable social norms. However, it is not straightforward to define these rules and incorporate them in an agent’s policy. Particularly because social norms are often implicit and environment specific. In this paper, we propose a novel iterative approach to extract a set of rules from observed human trajectories. This hybrid method combines the strengths of inverse reinforcement learning and inductive logic programming. We experimentally show how our method successfully induces a compact logic program which represents the behavioral constraints applicable in a Tower of Hanoi and a traffic simulator environment. The induced program is adopted as prior knowledge by a model-free reinforcement learning agent to speed up training and prevent any social norm violation during exploration and deployment. Moreover, expressing norms as a logic program provides improved interpretability, which is an important pillar in the design of safe artificial agents, as well as transferability to similar environments.
 
In this work we show that deep learning classifiers tend to become overconfident in their answers under adversarial attacks, even when the classifier is optimized to survive such attacks. Our work draws upon stochastic geometry and graph algorithms to propose a general framework to replace the last fully connected layer and softmax output. This framework (a) can be applied to any classifier and (b) significantly reduces the classifier’s overconfidence in its output without much of an impact on its accuracy when compared to original adversarially-trained classifiers. Its relative effectiveness increases as the attacker becomes more powerful. Our use of graph algorithms in adversarial learning is new and of independent interest. Finally, we show the advantages of this last-layer softmax replacement over image tasks under common adversarial attacks.
 
Above: Saliency of different convolutional filters over 150 epochs of SGD for a convolutional neural network trained on CIFAR-10 dataset. Below: Function samples drawn from GP with the exponential kernel. Both processes follow an exponentially decaying and asymptotic behavior
Small scale model neural network architecture for CIFAR-10. Parentheticals indicate the number of convolutional filters, or neurons in a layer. The receptive field size for convolution is (3, 3), Max Pooling is done with receptive field size (2, 2)
Visualization of qualitative differences between GP and MOGP prediction. Top: GP. Bottom: 18-MOGP. Dataset is separated into training (green) set of observations and future saliency forms the validation (blue) set. Posterior belief of the saliency is visualized as predictive mean (red line), and 95%\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$95\%$$\end{document} confidence interval (error bar).In many cases (e.g., top left graph), GP is unable to predict the long term trends of the data due to irregular element saliency observations. However, MOGP is able to overcome data irregularities by utilizing correlations between saliency of the network elements
Comparing dynamic penalty scaling versus static on pruning tasks on CIFAR-100 CNN training. Dynamic penalty scaling encourages gradual pruning across a wide variety of settings of λ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda$$\end{document}
Correlation between saliency of network elements at initialization, and saliency of network elements after training. Top: CIFAR-10, Bottom: CIFAR-100. Left-to-right: Layers 1 through 5 of the convolutional neural network
Deep neural networks (DNNs) are costly to train. Pruning, an approach to alleviate model complexity by zeroing out or pruning DNN elements, has shown promise in reducing training costs for DNNs with little to no efficacy at a given task. This paper presents a novel method to perform early pruning of DNN elements (e.g., neurons or convolutional filters) during the training process while minimizing losses to model performance. To achieve this, we model the efficacy of DNN elements in a Bayesian manner conditioned upon efficacy data collected during the training and prune DNN elements with low predictive efficacy after training completion. Empirical evaluations show that the proposed Bayesian early pruning improves the computational efficiency of DNN training while better preserving model performance compared to other tested pruning approaches.
 
Counterfactuals are central in causal human reasoning and the scientific discovery process. The uplift, also called conditional average treatment effect, measures the causal effect of some action, or treatment, on the outcome of an individual. This paper discusses how it is possible to derive bounds on the probability of counterfactual statements based on uplift terms. First, we derive some original bounds on the probability of counterfactuals and we show that tightness of such bounds depends on the information of the feature set on the uplift term. Then, we propose a point estimator based on the assumption of conditional independence between the counterfactual outcomes. The quality of the bounds and the point estimators are assessed on synthetic data and a large real-world customer data set provided by a telecom company, showing significant improvement over the state of the art.
 
Recent work has shown learning systems can use logical background knowledge to compensate for a lack of labeled training data. Many methods work by creating a loss function that encodes this knowledge. However, often the logic is discarded after training, even if it is still helpful at test time. Instead, we ensure neural network predictions satisfy the knowledge by refining the predictions with an extra computation step. We introduce differentiable refinement functions that find a corrected prediction close to the original prediction. We study how to effectively and efficiently compute these refinement functions. Using a new algorithm called iterative local refinement (ILR), we combine refinement functions to find refined predictions for logical formulas of any complexity. ILR finds refinements on complex SAT formulas in significantly fewer iterations and frequently finds solutions where gradient descent can not. Finally, ILR produces competitive results in the MNIST addition task.
 
Deep neural learning has shown remarkable performance at learning representations for visual object categorization. However, deep neural networks such as CNNs do not explicitly encode objects and relations among them. This limits their success on tasks that require a deep logical understanding of visual scenes, such as Kandinsky patterns and Bongard problems. To overcome these limitations, we introduce αILP\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha {\textit{ILP}}$$\end{document}, a novel differentiable inductive logic programming framework that learns to represent scenes as logic programs—intuitively, logical atoms correspond to objects, attributes, and relations, and clauses encode high-level scene information. α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document}ILP has an end-to-end reasoning architecture from visual inputs. Using it, α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha$$\end{document}ILP performs differentiable inductive logic programming on complex visual scenes, i.e., the logical rules are learned by gradient descent. Our extensive experiments on Kandinsky patterns and CLEVR-Hans benchmarks demonstrate the accuracy and efficiency of αILP\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\alpha {\textit{ILP}}$$\end{document} in learning complex visual-logical concepts.
 
Visualisation of an embedded matrix created using two time series: TS1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$TS_{1}$$\end{document} and TS2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$TS_{2}$$\end{document}. Each row contains a window of a time series which is composed with a set of lagged values (L1 to L5) and the corresponding next value/true output (y). The windows corresponding with the training and test sets are respectively highlighted in yellow and blue
Overview of an example SETAR-Tree constructed using TS1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$TS_{1}$$\end{document} and TS2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$TS_{2}$$\end{document}. The matrices containing the past time series lags and their corresponding true outputs are considered as nodes. Each row of a matrix is considered as a training instance. The training instances of each node are optimally split into child nodes based on the concept of SETAR models. Here l and T mentioned with the split nodes refer to the optimal lag and the optimal threshold that are used to split at a particular node. A node is only split if there exists remaining non-linearity in the training instances of that node and/or it can gain a certain training error reduction by splitting. This example tree model has three levels. This model uses a leaf-wise tree growth approach and thus, the branches of the tree may have different lengths
Threshold Autoregressive (TAR) models have been widely used by statisticians for non-linear time series forecasting during the past few decades, due to their simplicity and mathematical properties. On the other hand, in the forecasting community, general-purpose tree-based regression algorithms (forests, gradient-boosting) have become popular recently due to their ease of use and accuracy. In this paper, we explore the close connections between TAR models and regression trees. These enable us to use the rich methodology from the literature on TAR models to define a hierarchical TAR model as a regression tree that trains globally across series, which we call SETAR-Tree. In contrast to the general-purpose tree-based models that do not primarily focus on forecasting, and calculate averages at the leaf nodes, we introduce a new forecasting-specific tree algorithm that trains global Pooled Regression (PR) models in the leaves allowing the models to learn cross-series information and also uses some time-series-specific splitting and stopping procedures. The depth of the tree is controlled by conducting a statistical linearity test commonly employed in TAR models, as well as measuring the error reduction percentage at each node split. Thus, the proposed tree model requires minimal external hyperparameter tuning and provides competitive results under its default configuration. We also use this tree algorithm to develop a forest where the forecasts provided by a collection of diverse SETAR-Trees are combined during the forecasting process. In our evaluation on eight publicly available datasets, the proposed tree and forest models are able to achieve significantly higher accuracy than a set of state-of-the-art tree-based algorithms and forecasting benchmarks across four evaluation metrics.
 
Performance of trained neural network (NN) models, in terms of testing accuracy, has improved remarkably over the past several years, especially with the advent of deep learning. However, even the most accurate NNs can be biased toward a specific output classification due to the inherent bias in the available training datasets, which may propagate to the real-world implementations. This paper deals with the robustness bias, i.e., the bias exhibited by the trained NN by having a significantly large robustness to noise for a certain output class, as compared to the remaining output classes. The bias is shown to result from imbalanced datasets, i.e., the datasets where all output classes are not equally represented. Towards this, we propose the UnbiasedNets framework, which leverages K-means clustering and the NN’s noise tolerance to diversify the given training dataset, even from relatively smaller datasets. This generates balanced datasets and reduces the bias within the datasets themselves. To the best of our knowledge, this is the first framework catering to the robustness bias problem in NNs. We use real-world datasets to demonstrate the efficacy of the UnbiasedNets for data diversification, in case of both binary and multi-label classifiers. The results are compared to well-known tools aimed at generating balanced datasets, and illustrate how existing works have limited success while addressing the robustness bias. In contrast, UnbiasedNets provides a notable improvement over existing works, while even reducing the robustness bias significantly in some cases, as observed by comparing the NNs trained on the diversified and original datasets.
 
This paper investigates the validity of Kleinberg's axioms for clustering functions with respect to the quite popular clustering algorithm called $k$-means. While Kleinberg's axioms have been discussed heavily in the past, we concentrate here on the case predominantly relevant for $k$-means algorithm, that is behavior embedded in Euclidean space. We point at some contradictions and counter intuitiveness aspects of this axiomatic set within $\mathbb{R}^m$ that were evidently not discussed so far. Our results suggest that apparently without defining clearly what kind of clusters we expect we will not be able to construct a valid axiomatic system. In particular we look at the shape and the gaps between the clusters. Finally we demonstrate that there exist several ways to reconcile the formulation of the axioms with their intended meaning and that under this reformulation the axioms stop to be contradictory and the real-world $k$-means algorithm conforms to this axiomatic system.
 
The early detection of anomalous events in time series data is essential in many domains of application. In this paper we deal with critical health events, which represent a significant cause of mortality in intensive care units of hospitals. The timely prediction of these events is crucial for mitigating their consequences and improving healthcare. One of the most common approaches to tackle early anomaly detection problems is through standard classification methods. In this paper we propose a novel method that uses a layered learning architecture to address these tasks. One key contribution of our work is the idea of pre-conditional events, which denote arbitrary but computable relaxed versions of the event of interest. We leverage this idea to break the original problem into two hierarchical layers, which we hypothesize are easier to solve. The results suggest that the proposed approach leads to a better performance relative to state of the art approaches for critical health episode prediction.
 
We propose a decentralised “local2global” approach to graph representation learning, that one can a-priori use to scale any embedding technique. Our local2global approach proceeds by first dividing the input graph into overlapping subgraphs (or “patches”) and training local representations for each patch independently. In a second step, we combine the local representations into a globally consistent representation by estimating the set of rigid motions that best align the local representations using information from the patch overlaps, via group synchronization. A key distinguishing feature of local2global relative to existing work is that patches are trained independently without the need for the often costly parameter synchronization during distributed training. This allows local2global to scale to large-scale industrial applications, where the input graph may not even fit into memory and may be stored in a distributed manner. We apply local2global on data sets of different sizes and show that our approach achieves a good trade-off between scale and accuracy on edge reconstruction and semi-supervised classification. We also consider the downstream task of anomaly detection and show how one can use local2global to highlight anomalies in cybersecurity networks.
 
Existing explanation algorithms have found that, even if deep models make the same correct predictions on the same image, they might rely on different sets of input features for classification. However, among these features, some common features might be used by the majority of models. In this paper, we are wondering what the common features used by various models for classification are and whether the models with better performance may favor those common features. For this purpose, our work uses an explanation algorithm to attribute the importance of features (e.g., pixels or superpixels) as explanations and proposes the cross-model consensus of explanations to capture the common features. Specifically, we first prepare a set of deep models as a committee, then deduce the explanation for every model, and obtain the consensus of explanations across the entire committee through voting. With the cross-model consensus of explanations, we conduct extensive experiments using 80+ models on five datasets/tasks. We find three interesting phenomena as follows: (1) the consensus obtained from image classification models is aligned with the ground truth of semantic segmentation; (2) we measure the similarity of the explanation result of each model in the committee to the consensus (namely consensus score), and find positive correlations between the consensus score and model performance; and (3) the consensus score potentially correlates to the interpretability.
 
We study linear ill-posed inverse problems with noisy data in the framework of statistical learning. The corresponding linear operator equation is assumed to fit a given Hilbert scale, generated by some unbounded self-adjoint operator. Approximate reconstructions from random noisy data are obtained with general regularization schemes in such a way that these belong to the domain of the generator. The analysis has thus to distinguish two cases, the regular one, when the true solution also belongs to the domain of the generator, and the ‘oversmoothing’ one, when this is not the case. Rates of convergence for the regularized solutions will be expressed in terms of certain distance functions. For solutions with smoothness given in terms of source conditions with respect to the scale generating operator, then the error bounds can then be made explicit in terms of the sample size.
 
Unsupervised discretization is a crucial step in many knowledge discovery tasks. The state-of-the-art method for one-dimensional data infers locally adaptive histograms using the minimum description length (MDL) principle, but the multi-dimensional case is far less studied: current methods consider the dimensions one at a time (if not independently), which result in discretizations based on rectangular cells of adaptive size. Unfortunately, this approach is unable to adequately characterize dependencies among dimensions and/or results in discretizations consisting of more cells (or bins) than is desirable. To address this problem, we propose an expressive model class that allows for far more flexible partitions of two-dimensional data. We extend the state of the art for the one-dimensional case to obtain a model selection problem based on the normalized maximum likelihood, a form of refined MDL. As the flexibility of our model class comes at the cost of a vast search space, we introduce a heuristic algorithm, named PALM, which partitions each dimension alternately and then merges neighboring regions, all using the MDL principle. Experiments on synthetic data show that PALM (1) accurately reveals ground truth partitions that are within the model class (i.e., the search space), given a large enough sample size; (2) approximates well a wide range of partitions outside the model class; (3) converges, in contrast to the state-of-the-art multivariate discretization method IPD. Finally, we apply our algorithm to three spatial datasets, and we demonstrate that, compared to kernel density estimation (KDE), our algorithm not only reveals more detailed density changes, but also fits unseen data better, as measured by the log-likelihood.
 
Label collection is costly in many applications, which poses the need for label-efficient learning. In this work, we present Diverse and Consistent Multi-view Networks (DiCoM)—a novel semi-supervised regression technique based on a multi-view learning framework. DiCoM combines diversity with consistency—two seemingly opposing yet complementary principles of multi-view learning—based on underlying probabilistic graphical assumptions. Given multiple deep views of the same input, DiCoM encourages a negative correlation among the views’ predictions on labeled data, while simultaneously enforces their agreement on unlabeled data. DiCoM can utilize either multi-network or multi-branch architectures to make a trade-off between computational cost and modeling performance. Under realistic evaluation setups, DiCoM outperforms competing methods on tabular, time series and image data. Our ablation studies confirm the importance of having both consistency and diversity.
 
Plot shows the critical value of σ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma$$\end{document} for which the exponential term of (64) is dominated by the second term, thereby allowing Proposition 1 to hold. In particular, anyσ>0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma > 0$$\end{document}chosen between zero and the curves shown above satisfies the proposition. We show plots for varying values of λTD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _\text {TD}$$\end{document}, which is determined by the feature space representation. For each value of λTD\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lambda _\text {TD}$$\end{document}, we vary the ratio of the constants K2/K1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$K_2/K_1$$\end{document} from .001 to 100
Navigation Problem: (a) Average reward per episode with confidence bounds over 50 trials. (b) Average gradient norm proxy over 50 trials. A-GTD converges fastest with respect to the cumulative reward and gradient norm proxy at the cost of converging to a suboptimal stationary point (see Fig. 3). A moving average filter of size ten has been applied on the gradient norm proxy to aid in comparison
Visualization of the learned policy for the navigation problem. The obstacle is shown in the top right corner, and the target is located at (-2,-2). As Fig. 2 (a) depicts, TD (shown in (a)) and GTD (shown in (b)) learn meaningful policies which guide the agent to the target. In contrast, A-GTD (shown in (c)) simply learns to avoid the obstacle
Pendulum Problem: (a) Average reward per episode with confidence bounds over 50 trials. (b) Average gradient norm proxy over 50 trials. In contrast to the navigation problem there is a significant gain in using advantage actor-critic; here, the state action (Q) function was used instead of the value function (V). A moving average filter of size ten has been applied on the gradient norm proxy to aid in comparison
Reinforcement learning, mathematically described by Markov Decision Problems, may be approached either through dynamic programming or policy search. Actor-critic algorithms combine the merits of both approaches by alternating between steps to estimate the value function and policy gradient updates. Due to the fact that the updates exhibit correlated noise and biased gradient updates, only the asymptotic behavior of actor-critic is known by connecting its behavior to dynamical systems. This work puts forth a new variant of actor-critic that employs Monte Carlo rollouts during the policy search updates, which results in controllable bias that depends on the number of critic evaluations. As a result, we are able to provide for the first time the convergence rate of actor-critic algorithms when the policy search step employs policy gradient, agnostic to the choice of policy evaluation technique. In particular, we establish conditions under which the sample complexity is comparable to stochastic gradient method for non-convex problems or slower as a result of the critic estimation error, which is the main complexity bottleneck. These results hold in continuous state and action spaces with linear function approximation for the value function. We then specialize these conceptual results to the case where the critic is estimated by Temporal Difference, Gradient Temporal Difference, and Accelerated Gradient Temporal Difference. These learning rates are then corroborated on a navigation problem involving an obstacle and the pendulum problem which provide insight into the interplay between optimization and generalization in reinforcement learning.
 
Multi-output Gaussian processes (MOGPs) can help to improve predictive performance for some output variables, by leveraging the correlation with other output variables. In this paper, our main motivation is to use multiple-output Gaussian processes to exploit correlations between outputs where each output is a multi-class classification problem. MOGPs have been mostly used for multi-output regression. There are some existing works that use MOGPs for other types of outputs, e.g., multi-output binary classification. However, MOGPs for multi-class classification has been less studied. The reason is twofold: 1) when using a softmax function, it is not clear how to scale it beyond the case of a few outputs; 2) most common type of data in multi-class classification problems consists of image data, and MOGPs are not specifically designed to image data. We thus propose a new MOGPs model called Multi-output Gaussian Processes with Augment & Reduce (MOGPs-AR) that can deal with large scale classification and downsized image input data. Large scale classification is achieved by subsampling both training data sets and classes in each output whereas downsized image input data is handled by incorporating a convolutional kernel into the new model. We show empirically that our proposed model outperforms single-output Gaussian processes in terms of different performance metrics and multi-output Gaussian processes in terms of scalability, both in synthetic and in real classification problems. We include an example with the Ommiglot dataset where we showcase the properties of our model.
 
Approximate learning machines have become popular in the era of small devices, including quantised, factorised, hashed, or otherwise compressed predictors, and the quest to explain and guarantee good generalisation abilities for such methods has just begun. In this paper, we study the role of approximability in learning, both in the full precision and the approximated settings. We do this through a notion of sensitivity of predictors to the action of the approximation operator at hand. We prove upper bounds on the generalisation of such predictors, yielding the following main findings, for any PAC-learnable class and any given approximation operator: (1) We show that under mild conditions, approximable target concepts are learnable from a smaller labelled sample, provided sufficient unlabelled data; (2) We give algorithms that guarantee a good predictor whose approximation also enjoys the same generalisation guarantees; (3) We highlight natural examples of structure in the class of sensitivities, which reduce, and possibly even eliminate the otherwise abundant requirement of additional unlabelled data, and henceforth shed new light onto what makes one problem instance easier to learn than another. These results embed the scope of modern model-compression approaches into the general goal of statistical learning theory, which in return suggests appropriate algorithms through minimising uniform bounds.
 
Being able to provide counterfactual interventions—sequences of actions we would have had to take for a desirable outcome to happen—is essential to explain how to change an unfavourable decision by a black-box machine learning model (e.g., being denied a loan request). Existing solutions have mainly focused on generating feasible interventions without providing explanations of their rationale. Moreover, they need to solve a separate optimization problem for each user. In this paper, we take a different approach and learn a program that outputs a sequence of explainable counterfactual actions given a user description and a causal graph. We leverage program synthesis techniques, reinforcement learning coupled with Monte Carlo Tree Search for efficient exploration, and rule learning to extract explanations for each recommended action. An experimental evaluation on synthetic and real-world datasets shows how our approach, FARE (eFficient counterfActual REcourse), generates effective interventions by making orders of magnitude fewer queries to the black-box classifier with respect to existing solutions, with the additional benefit of complementing them with interpretable explanations.
 
Empirical risk (MSE) difference evaluated on test data as a function of the training iterations. Left: baseline lasso model. Right: the proposed MRD method. The red (lasso) and blue (proposed method) curves represent the 25th, 50th, and 75th percentiles of the {RD^j:j∈H1}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\widehat{\text {RD}}_j: j \in \mathcal {H}_1\}$$\end{document}. The green (lasso) and orange (proposed method) color-shaded curves represent the 75th percentile of {RD^j:j∈H0}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{\widehat{\text {RD}}_j: j \in \mathcal {H}_0\}$$\end{document}. Each point is evaluated by averaging RD^j\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\text {RD}}_j$$\end{document} over 100 independent data sets
Synthetic experiments on a sparse interaction model (M4) with varying signal amplitude. Power and FDR (q=0.2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=0.2$$\end{document}) are evaluated on 50 independent trials
Semi-synthetic experiments with real HIV mutation features and a simulated non-linear response using the generating function M2. Empirical power and FDR with q=0.2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$q=0.2$$\end{document} are evaluated over 20 random train/test splits
The average number of drug-resistance mutations in the HIV discovered by base/MRD models
The model-X conditional randomization test is a generic framework for conditional independence testing, unlocking new possibilities to discover features that are conditionally associated with a response of interest while controlling type I error rates. An appealing advantage of this test is that it can work with any machine learning model to design powerful test statistic. In turn, the common practice in the model-X literature is to form a test statistic using machine learning models, trained to maximize predictive accuracy with the hope to attain a test with good power. However, the ideal goal here is to drive the model (during training) to maximize the power of the test, not merely the predictive accuracy. In this paper, we bridge this gap by introducing novel model-fitting schemes that are designed to explicitly improve the power of model-X tests. This is done by introducing a new cost function that aims at maximizing the test statistic used to measure violations of conditional independence. Using synthetic and real data sets, we demonstrate that the combination of our proposed loss function with various base predictive models (lasso, elastic net, and deep neural networks) consistently increases the number of correct discoveries obtained, while maintaining type I error rates under control.
 
Gaussian processes (GPs) are an important tool in machine learning and statistics. However, off-the-shelf GP inference procedures are limited to datasets with several thousand data points because of their cubic computational complexity. For this reason, many sparse GPs techniques have been developed over the past years. In this paper, we focus on GP regression tasks and propose a new approach based on aggregating predictions from several local and correlated experts. Thereby, the degree of correlation between the experts can vary between independent up to fully correlated experts. The individual predictions of the experts are aggregated taking into account their correlation resulting in consistent uncertainty estimates. Our method recovers independent Product of Experts, sparse GP and full GP in the limiting cases. The presented framework can deal with a general kernel function and multiple variables, and has a time and space complexity which is linear in the number of experts and data samples, which makes our approach highly scalable. We demonstrate superior performance, in a time vs. accuracy sense, of our proposed method against state-of-the-art GP approximations for synthetic as well as several real-world datasets with deterministic and stochastic optimization.
 
The slowness principle is a concept inspired by the visual cortex of the brain. It postulates that the underlying generative factors of a quickly varying sensory signal change on a different, slower time scale. By applying this principle to state-of-the-art unsupervised representation learning methods one can learn a latent embedding to perform supervised downstream regression tasks more data efficient. In this paper, we compare different approaches to unsupervised slow representation learning such as $$L_p$$ L p norm based slowness regularization and the SlowVAE, and propose a new term based on Brownian motion used in our method, the S-VAE. We empirically evaluate these slowness regularization terms with respect to their downstream task performance and data efficiency in state estimation and behavioral cloning tasks. We find that slow representations show great performance improvements in settings where only sparse labeled training data is available. Furthermore, we present a theoretical and empirical comparison of the discussed slowness regularization terms. Finally, we discuss how the Fréchet Inception Distance (FID), commonly used to determine the generative capabilities of GANs, can predict the performance of trained models in supervised downstream tasks.
 
A running example from an example dataset to the result rule set
Nemenyi test on accuracy of algorithms on the first 22 datasets, CD=2.24,α=0.05\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$CD = 2.24, \alpha = 0.05$$\end{document}
Influence of data sizes and thread counts on the consumed memory and the runtime
Influence of m-estimate heuristic on the classification accuracy. The x-axes and y-axes indicate the values of m parameter of m-estimate heuristic and classification accuracy respectively
Classification performance of Lord against the percentage of best rules remained in rule sets produced from medium and big datasets
Conventional rule learning algorithms aim at finding a set of simple rules, where each rule covers as many examples as possible. In this paper, we argue that the rules found in this way may not be the optimal explanations for each of the examples they cover. Instead, we propose an efficient algorithm that aims at finding the best rule covering each training example in a greedy optimization consisting of one specialization and one generalization loop. These locally optimal rules are collected and then filtered for a final rule set, which is much larger than the sets learned by conventional rule learning algorithms. A new example is classified by selecting the best among the rules that cover this example. In our experiments on small to very large datasets, the approach’s average classification accuracy is higher than that of state-of-the-art rule learning algorithms. Moreover, the algorithm is highly efficient and can inherently be processed in parallel without affecting the learned rule set and so the classification accuracy. We thus believe that it closes an important gap for large-scale classification rule induction.
 
Logic-based machine learning aims to learn general, interpretable knowledge in a data-efficient manner. However, labelled data must be specified in a structured logical form. To address this limitation, we propose a neural-symbolic learning framework, called Feed-Forward Neural-Symbolic Learner (FFNSL), that integrates a logic-based machine learning system capable of learning from noisy examples, with neural networks, in order to learn interpretable knowledge from labelled unstructured data. We demonstrate the generality of FFNSL on four neural-symbolic classification problems, where different pre-trained neural network models and logic-based machine learning systems are integrated to learn interpretable knowledge from sequences of images. We evaluate the robustness of our framework by using images subject to distributional shifts, for which the pre-trained neural networks may predict incorrectly and with high confidence. We analyse the impact that these shifts have on the accuracy of the learned knowledge and run-time performance, comparing FFNSL to tree-based and pure neural approaches. Our experimental results show that FFNSL outperforms the baselines by learning more accurate and interpretable knowledge with fewer examples.
 
For many machine-learning tasks, augmenting the data table at hand with features built from external sources is key to improving performance. For instance, estimating housing prices benefits from background information on the location, such as the population density or the average income. However, this information must often be assembled across many tables, requiring time and expertise from the data scientist. Instead, we propose to replace human-crafted features by vectorial representations of entities (e.g. cities) that capture the corresponding information. We represent the relational data on the entities as a graph and adapt graph-embedding methods to create feature vectors for each entity. We show that two technical ingredients are crucial: modeling well the different relationships between entities, and capturing numerical attributes. We adapt knowledge graph embedding methods that were primarily designed for graph completion. Yet, they model only discrete entities, while creating good feature vectors from relational data also requires capturing numerical attributes. For this, we introduce KEN: Knowledge Embedding with Numbers. We thoroughly evaluate approaches to enrich features with background information on 7 prediction tasks. We show that a good embedding model coupled with KEN can perform better than manually handcrafted features, while requiring much less human effort. It is also competitive with combinatorial feature engineering methods, but much more scalable. Our approach can be applied to huge databases, creating general-purpose feature vectors reusable in various downstream tasks.
 
Preference-based reinforcement learning (PbRL) develops agents using human preferences. Due to its empirical success, it has prospect of benefiting human-centered applications. Meanwhile, previous work on PbRL overlooks interpretability, which is an indispensable element of ethical artificial intelligence (AI). While prior art for explainable AI offers some machinery, there lacks an approach to select samples to construct explanations. This becomes an issue for PbRL, as transitions relevant to task solving are often outnumbered by irrelevant ones. Thus, ad-hoc sample selection undermines the credibility of explanations. The present study proposes a framework for learning reward functions and state importance from preferences simultaneously. It offers a systematic approach for selecting samples when constructing explanations. Moreover, the present study proposes a perturbation analysis to evaluate the learned state importance quantitatively. Through experiments on discrete and continuous control tasks, the present study demonstrates the proposed framework’s efficacy for providing interpretability without sacrificing task performance.
 
The two types of instances
Regret of Gaussian instance
Regret of Beta distributed instance
Comparison with the Lagrangian relaxation
We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a ‘primary’ attribute subject to user-provided constraints on other ‘secondary’ attributes. We assume that the attributes can be estimated using samples from the arms’ distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful.
 
Class imbalance occurs when the class distribution is not equal. Namely, one class is under-represented (minority class), and the other class has significantly more samples in the data (majority class). The class imbalance problem is prevalent in many real world applications. Generally, the under-represented minority class is the class of interest. The synthetic minority over-sampling technique (SMOTE) method is considered the most prominent method for handling unbalanced data. The SMOTE method generates new synthetic data patterns by performing linear interpolation between minority class samples and their K nearest neighbors. However, the SMOTE generated patterns do not necessarily conform to the original minority class distribution. This paper develops a novel theoretical analysis of the SMOTE method by deriving the probability distribution of the SMOTE generated samples. To the best of our knowledge, this is the first work deriving a mathematical formulation for the SMOTE patterns’ probability distribution. This allows us to compare the density of the generated samples with the true underlying class-conditional density, in order to assess how representative the generated samples are. The derived formula is verified by computing it on a number of densities versus densities computed and estimated empirically.
 
In federated learning, client models are often trained on local training sets that vary in size and distribution. Such statistical heterogeneity in training data leads to performance variations across local models. Even within a model, some parameter estimates can be more reliable than others. Most existing FL approaches (such as FedAvg), however, do not explicitly address such variations in client parameter estimates and treat all local parameters with equal importance in the model aggregation. This disregard of varying evidential credence among client models often leads to slow convergence and a sensitive global model. We address this gap by proposing an aggregation mechanism based upon the Hessian matrix. Further, by making use of the first-order information of the loss function, we can use the Hessian as a scaling matrix in a manner akin to that employed in Quasi-Newton methods. This treatment captures the impact of data quality variations across local models. Experiments show that our method is superior to the baselines of Federated Average (FedAvg), FedProx, Federated Curvature (FedCurv) and Federated Newton Learn (FedNL) for image classification on MNIST, Fashion-MNIST, and CIFAR-10 datasets when the client models are trained using statistically heterogeneous data.
 
In artificial multi-agent systems, the ability to learn collaborative policies is predicated upon the agents’ communication skills: they must be able to encode the information received from the environment and learn how to share it with other agents as required by the task at hand. We present a deep reinforcement learning approach, Connectivity Driven Communication (CDC), that facilitates the emergence of multi-agent collaborative behaviour only through experience. The agents are modelled as nodes of a weighted graph whose state-dependent edges encode pair-wise messages that can be exchanged. We introduce a graph-dependent attention mechanisms that controls how the agents’ incoming messages are weighted. This mechanism takes into full account the current state of the system as represented by the graph, and builds upon a diffusion process that captures how the information flows on the graph. The graph topology is not assumed to be known a priori, but depends dynamically on the agents’ observations, and is learnt concurrently with the attention mechanism and policy in an end-to-end fashion. Our empirical results show that CDC is able to learn effective collaborative policies and can over-perform competing learning algorithms on cooperative navigation tasks.
 
In this paper we address imbalanced binary classification (IBC) tasks. Applying resampling strategies to balance the class distribution of training instances is a common approach to tackle these problems. Many state-of-the-art methods find instances of interest close to the decision boundary to drive the resampling process. However, under-sampling the majority class may potentially lead to important information loss. Over-sampling also may increase the chance of overfitting by propagating the information contained in instances from the minority class. The main contribution of our work is a new method called ICLL for tackling IBC tasks which is not based on resampling training observations. Instead, ICLL follows a layered learning paradigm to model the data in two stages. In the first layer, ICLL learns to distinguish cases close to the decision boundary from cases which are clearly from the majority class, where this dichotomy is defined using a hierarchical clustering analysis. In the subsequent layer, we use instances close to the decision boundary and instances from the minority class to solve the original predictive task. A second contribution of our work is the automatic definition of the layers which comprise the layered learning strategy using a hierarchical clustering model. This is a relevant discovery as this process is usually performed manually according to domain knowledge. We carried out extensive experiments using 100 benchmark data sets. The results show that the proposed method leads to a better performance relatively to several state-of-the-art methods for IBC.
 
Deep learning has recently unleashed the ability for Machine learning (ML) to make unparalleled strides. It did so by confronting and successfully addressing, at least to a certain extent, the knowledge bottleneck that paralyzed ML and artificial intelligence for decades. The community is currently basking in deep learning’s success, but a question that comes to mind is: have all of the issues previously affecting machine learning systems been solved by deep learning or do some issues remain for which deep learning is not a bulletproof solution? This question in the context of the class imbalance becomes a motivation for this paper. Imbalance problem was first recognized almost three decades ago and has remained a critical challenge at least for traditional learning approaches. Our goal is to investigate whether the tight dependency between class imbalances, concept complexities, dataset size and classifier performance, known to exist in traditional learning systems, is alleviated in any way in deep learning approaches and to what extent, if any, network depth and regularization can help. To answer these questions we conduct a survey of the recent literature focused on deep learning and the class imbalance problem as well as a series of controlled experiments on both artificial and real-world domains. This allows us to formulate lessons learned about the impact of class imbalance on deep learning models, as well as pose open challenges that should be tackled by researchers in this field.
 
One notable weakness of current machine learning algorithms is the poor ability of models to solve new problems without forgetting previously acquired knowledge. The Continual Learning paradigm has emerged as a protocol to systematically investigate settings where the model sequentially observes samples generated by a series of tasks. In this work, we take a task-agnostic view of continual learning and develop a hierarchical information-theoretic optimality principle that facilitates a trade-off between learning and forgetting. We derive this principle from a Bayesian perspective and show its connections to previous approaches to continual learning. Based on this principle, we propose a neural network layer, called the Mixture-of-Variational-Experts layer, that alleviates forgetting by creating a set of information processing paths through the network which is governed by a gating policy. Equipped with a diverse and specialized set of parameters, each path can be regarded as a distinct sub-network that learns to solve tasks. To improve expert allocation, we introduce diversity objectives, which we evaluate in additional ablation studies. Importantly, our approach can operate in a task-agnostic way, i.e., it does not require task-specific knowledge, as is the case with many existing continual learning algorithms. Due to the general formulation based on generic utility functions, we can apply this optimality principle to a large variety of learning problems, including supervised learning, reinforcement learning, and generative modeling. We demonstrate the competitive performance of our method on continual reinforcement learning and variants of the MNIST, CIFAR-10, and CIFAR-100 datasets.
 
Despite the vast success of standard planar convolutional neural networks, they are not the most efficient choice for analyzing signals that lie on an arbitrarily curved manifold, such as a cylinder. The problem arises when one performs a planar projection of these signals and inevitably causes them to be distorted or broken where there is valuable information. We propose a Circular-symmetric Correlation Layer (CCL) based on the formalism of roto-translation equivariant correlation on the continuous group \(S^1 \times \mathbb {R}\), and implement it efficiently using the well-known Fast Fourier Transform (FFT) algorithm. We showcase the performance analysis of a general network equipped with CCL on various recognition and classification tasks and datasets.
 
Graph Neural Architecture Search (Graph-NAS) methods have shown great potential in finding better graph neural network designs compared to handcrafted designs. However, existing Graph-NAS frameworks are based on complex algorithms and fail to maintain low costs for high scalability with high performance. They require full training of thousands of graph neural networks to inform the search process, resulting in a prohibitive computational cost, which is not necessarily affordable for the users interested. Due to the computation cost, many researchers have limited the search space exploration ability, which may lead to a local optimum solution. In this paper, we propose a performance predictor-based graph neural architecture search (PGNAS) framework. The proposed approach consists of three conceptually much simpler and basic phases, and can broadly explore a search space with a much cheaper computation cost. We train n sampled architectures from a search space to generate n (architecture, validation accuracy) pairs used to train a performance distributions learner where the features are represented by the architecture description and the validation accuracy denotes the target. Next, we use this performance distribution learner to predict the validation accuracies of architectures in the search space. Finally, we train the top-K predicted architectures and choose the architecture with the best validation result. Although our approach seems simple, it is efficient and scalable; experiment results show that PGNAS outperforms existing both handcrafted and Graph-NAS models on four benchmark datasets.
 
Geometric analog of the generalized triangle inequality: the total area of any three faces in a tetrahedron is greater than that of the fourth face
Sample space Ω\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\Omega }$$\end{document}, mass functions {pi}i=14\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{{\varvec{p}}^{i}\}^4_{i=1}$$\end{document}, and cost function d\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${d}$$\end{document} that lead to violation (5). Red triangle is p1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{p}}^1$$\end{document}. Blue star is p2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{p}}^2$$\end{document}. Green squares are p3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{p}}^3$$\end{document}. Yellow circles are p4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\varvec{p}}^4$$\end{document}
Comparing the effect that different distances and metrics have on clustering synthetic graphs. Histogram with dashed outline is for non-n-metric, with solid outline for barycenter, and with dotted outline for pairwise MMOT
Comparing the effect that different distances and metrics have on clustering molecular graphs. Histogram with dashed outline is for non-n-metric, with solid outline for barycenter, and with dotted outline for pairwise MMOT
Hypergraphs are built by comparing molecules’ shapes using MMOT and creating hyperedges for similarly-shaped molecule groups. With a non generalized metric, gains from using a richer hypergraph (more hyperedges) are lost due to anomalies allowed by the lack of metricity (flat curves in green and red), not so with a generalized metric (negative slope curves). The generalised metric studied (pairwise) leads to a better clustering than previously studied metrics (barycenter). Error bars are the standard deviation of the average over 100 independent experiments
The Optimal transport (OT) problem is rapidly finding its way into machine learning. Favoring its use are its metric properties. Many problems admit solutions with guarantees only for objects embedded in metric spaces, and the use of non-metrics can complicate solving them. Multi-marginal OT (MMOT) generalizes OT to simultaneously transporting multiple distributions. It captures important relations that are missed if the transport only involves two distributions. Research on MMOT, however, has been focused on its existence, uniqueness, practical algorithms, and the choice of cost functions. There is a lack of discussion on the metric properties of MMOT, which limits its theoretical and practical use. Here, we prove new generalized metric properties for a family of pairwise MMOTs. We first explain the difficulty of proving this via two negative results. Afterward, we prove the MMOTs’ metric properties. Finally, we show that the generalized triangle inequality of this family of MMOTs cannot be improved. We illustrate the superiority of our MMOTs over other generalized metrics, and over non-metrics in both synthetic and real tasks.
 
Two signed muulti-relational graphs. The blue edges correspond to ’friends’ relations and the red one is ‘acquitances’ relation (Color figure online)
a shows a relational structure with 2 unary predicates and 2 binary relations. b the projection of the unary relation. c the projection of the binary relation. d the possible values for the unary relation. e the possible values for the binary relation. The circles represent the elements of the domain, boxes attached to the nodes indicate the values of the unary relations, and edges (with arrows) represent the binary relations
Graphons are limits of large graphs. Motivated by a theoretical problem from statistical relational learning, we develop a generalization of basic results from graphon theory into the “multi-relational” setting. We show that their multi-relational counterparts, which we call multi-relational graphons, are analogically limits of large multi-relational graphs. We extend the cut-distance topology for graphons to multi-relational graphons and prove its compactness and the density of multi-relational graphs in this topology. In turn, compactness enables to prove the large deviation principle for Multi-Relational Graphs (LDP) which enables to prove the most typical random graphs constrained by marginal statistics converge asymptotically to constrained multi-relational graphons with maximum entropy. We show the equivalence between a restricted version of Markov Logic Network and Multi-Relational Graphons with maximum entropy.
 
Collective graphical model (CGM) is a probabilistic model that provides a framework for analyzing aggregated count data. Maximum a posteriori (MAP) inference of unobserved variables under given observations is one of the essential operations in CGM. Because the MAP inference problem is known to be NP-hard in general, the current mainstream approach is to solve an alternative problem obtained by approximating the objective function and applying continuous relaxation. However, this approach has two significant drawbacks. First, the quality of the solution deteriorates when the values in the count data are negligible due to the inaccuracy of Stirling’s approximation. Second, the application of continuous relaxation causes the violation of integrality constraints. This paper proposes novel algorithms for MAP inference in CGMs on path graphs to overcome these problems. Our method is based on the discrete difference of convex algorithm (DCA); DCA is a general framework to minimize the sum of a convex function and a concave function by repeatedly minimizing surrogate functions. Utilizing the particular structure of path graphs, we efficiently solve the surrogate function minimization by minimum convex cost flow algorithms. Furthermore, our approach also leads to a new method of solving another important task; MAP inference of the sample size in CGM on path graphs. Our method is naturally applicable to this task, allowing us to design very efficient algorithms. Experimental results on synthetic and real-world datasets show the effectiveness of the proposed algorithms.
 
Federated Learning (FL) aims to generate a global shared model via collaborating with decentralized edge computing devices with privacy considerations. A significant challenge in FL is the non-IID data partitioning across heterogeneous devices, making the local update diverge a lot and difficult to aggregate. This diversity in local models is caused by the different posterior probability of samples when class distribution skews. Meanwhile, FL often faces imbalanced global data in practical scenarios. By analyzing the relationship between the samples’ posterior probability in different data distributions, we propose a statistically principled probability-corrected loss to align the posterior probability when models are trained on heterogeneous clients. Additionally, we share fixed prototypes on each client to constrain the distribution of heterogeneous clients’ features. Our approach can well handle non-IID FL with balanced and imbalanced global data. We combine our approach with existing FL algorithms and investigate it on common FL benchmarks. Abundant experimental results verify the superiorities of our methods.
 
Machine learning (ML) is extensively used in fields involving sensitive data, data holders are seeking to protect the privacy of data and build the ML models with high-quality utility. Differential privacy provides a feasible solution for them. However, there is a mutual constraint on the privacy and utility of the models under this solution. Therefore, how to improve (or even maximize) the utility of the model while preserving privacy becomes an urgent problem. To resolve this problem, we apply unscented Kalman filter (UKF) to various implementations of differential privacy (DP)-enabled ML (DPML). We propose a UKF-based DP-enabled ML (UKF-DPML) framework that achieves higher model utility with the given privacy budget ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon$$\end{document}. An evaluation module is included in the framework to ensure a fair estimation of DPML models. We validate the effectiveness of this framework through mathematical reasoning, followed by empirical evaluation of various implementations of UKF-DPML and DPML respectively. In the evaluation, we measure the ability of withstanding real-world privacy attacks and providing accurate classification, thus assessing the privacy and utility of the model. We conduct a range of privacy budgets and implementations on three datasets, each of which provides the same mathematical privacy guarantees. By measuring the resistance of UKF-DPML and DPML models to membership and attribute inference attacks and their classification accuracy, we obtain that applying UKF to the aggregates perturbed by DP noises results in higher utility with the same privacy budget and the effect of improved utility is related to the stage where UKF is applied.
 
Automated Machine Learning (AutoML) has been used successfully in settings where the learning task is assumed to be static. In many real-world scenarios, however, the data distribution will evolve over time, and it is yet to be shown whether AutoML techniques can effectively design online pipelines in dynamic environments. This study aims to automate pipeline design for online learning while continuously adapting to data drift. For this purpose, we design an adaptive Online Automated Machine Learning (OAML) system, searching the complete pipeline configuration space of online learners, including preprocessing algorithms and ensembling techniques. This system combines the inherent adaptation capabilities of online learners with fast automated pipeline (re)optimization. Focusing on optimization techniques that can adapt to evolving objectives, we evaluate asynchronous genetic programming and asynchronous successive halving to optimize these pipelines continually. We experiment on real and artificial data streams with varying types of concept drift to test the performance and adaptation capabilities of the proposed system. The results confirm the utility of OAML over popular online learning algorithms and underscore the benefits of continuous pipeline redesign in the presence of data drift.
 
Understanding customer behavior is necessary to develop efficient marketing strategies or launch tailored programs with social value for the public. Customer segmentation is a critical task for understanding diverse and dynamic customer behavior. However, as the popularity of different products varies, building dynamic customer behavior models for products with few customers may overfit the data. In this paper, we propose a new Bayesian nonparametric model for dynamic customer segmentation—Hierarchical Fragmentation-Coagulation Processes (HFCP), which allows sharing behavior patterns across multiple products. We conduct comprehensive empirical evaluations using two real-world purchase datasets. Our results show that HFCP can: (i) determine the number of groups required to model diverse customer behavior automatically; (ii) capture the changes such as split and merge of customer groups over time; (iii) discover behavior patterns shared among products and identify products with similar or different purchase behavior impacted by promotion, brand choice and change of seasons; and (iv) overcome overfitting problems and outperform previous customer segmentation models on estimating behavior for unseen customers. Hence, HFCP is a flexible and accurate segmentation model that can be used by stakeholders to understand dynamic customer behavior and compare the purchase behavior for different products.
 
Top-cited authors
Holger H Hoos
  • RWTH Aachen University
Jesper E. van Engelen
Bernhard Pfahringer
  • The University of Waikato
Willem Waegeman
  • Ghent University
Albert Bifet
  • Télécom ParisTech