Data Mining and Knowledge Discovery

Published by Springer Nature
Online ISSN: 1573-756X
Learn more about this page
Recent publications
Left: Decision surface and hierarchy of a classifier for Ann’s social context and five concepts: “Person”, “Boss”, “Subordinate”, “Dave”, and “Earl”. Middle: Individual Concept Drift: Dave moves to a different office, so the decision surface changes but the hierarchy remains the same. Right: Knowledge Drift: Ann is promoted, “Dave” is now her subordinate. If the classifier knows that the hierarchy changed, it can transfer examples from “Dave” to “Subordinate”, quickly improving its performance
Comparison in terms of micro F1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} between trckd and standard forgetting strategies for kNN-based methods. Top to bottom: results for H-STAGGER, H-EMNIST and H-20NG. Left to right: concept removal, relation addition, and relation removal. Error bars indicate std. error
Comparison in terms of micro F1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} between trckd and less interactive variants. Top to bottom: results for H-STAGGER, H-EMNIST and H-20NG. Left to right: concept drift, concept removal, relation addition, and relation removal. Error bars indicate std. error
trckd versus competitors on sequential KD in terms of micro F1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document}. Left: H-STAGGER. Middle: H-EMNIST. Right: H-20NG. Error bars indicate std. error
Micro F1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document} results on H-STAGGER, automatic drift type identification with Graphical Lasso and MMD for detection versus trckd. Left: relation addition. Right: relation removal
We introduce and study knowledge drift (KD), a special form of concept drift that occurs in hierarchical classification. Under KD the vocabulary of concepts, their individual distributions, and the is-a relations between them can all change over time. The main challenge is that, since the ground-truth concept hierarchy is unobserved, it is hard to tell apart different forms of KD. For instance, the introduction of a new is-a relation between two concepts might be confused with changes to those individual concepts, but it is far from equivalent. Failure to identify the right kind of KD compromises the concept hierarchy used by the classifier, leading to systematic prediction errors. Our key observation is that in human-in-the-loop applications like smart personal assistants the user knows what kind of drift occurred recently, if any. Motivated by this observation, we introduce trckd , a novel approach that combines two automated stages—drift detection and adaptation—with a new interactive disambiguation stage in which the user is asked to refine the machine’s understanding of recently detected KD. In addition, trckd implements a simple but effective knowledge-aware adaptation strategy. Our simulations show that, when the structure of the concept hierarchy drifts, a handful of queries to the user are often enough to substantially improve prediction performance on both synthetic and realistic data.
The identification of relevant features, i.e., the driving variables that determine a process or the properties of a system, is an essential part of the analysis of data sets with a large number of variables. A mathematical rigorous approach to quantifying the relevance of these features is mutual information. Mutual information determines the relevance of features in terms of their joint mutual dependence to the property of interest. However, mutual information requires as input probability distributions, which cannot be reliably estimated from continuous distributions such as physical quantities like lengths or energies. Here, we introduce total cumulative mutual information (TCMI), a measure of the relevance of mutual dependences that extends mutual information to random variables of continuous distribution based on cumulative probability distributions. TCMI is a non-parametric, robust, and deterministic measure that facilitates comparisons and rankings between feature sets with different cardinality. The ranking induced by TCMI allows for feature selection, i.e., the identification of variable sets that are nonlinear statistically related to a property of interest, taking into account the number of data samples as well as the cardinality of the set of variables. We evaluate the performance of our measure with simulated data, compare its performance with similar multivariate-dependence measures, and demonstrate the effectiveness of our feature-selection method on a set of standard data sets and a typical scenario in materials science.
Sample images of four image datasets
Parameter analysis (of the proposed method)on Extended Yale B dataset
From left to right: The columns are the results of BDR (ACC=0.652), ECMSC (ACC = 0.812) and CBDMSC (ACC = 0.969), respectively. From top to bottom: The rows are the visualization of the affinity matrices and FFT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{FF}^T$$\end{document}, respectively. For ease of visualization, we show the absolute value of the matrices and rescale each entry by a factor of 800
The object of multi-view subspace clustering is to uncover the latent low-dimensional structure by segmenting a collection of high-dimensional multi-source data into their corresponding subspaces. Existing methods imposed various constraints on the affinity matrix and/or the cluster labels to promote segmentation accuracy, and demonstrated effectiveness in some applications. However, the previous constraints are inefficient to ensure the ideal discriminative capability of the corresponding method. In this paper, we propose to learn view-specific affinity matrices and a common cluster indicator matrix jointly in a unified minimization problem, in which the affinity matrices and the cluster indicator matrix can guide each other to facilitate the final segmentation. To enforce the ideal discrimination, we use a block diagonal inducing regularity to constrain the affinity matrices as well as the cluster indicator matrix. Such coupled regularities are double insurances to promote clustering accuracy. We call it Coupled Block Diagonal Regularized Multi-view Subspace Clustering (CBDMSC). Based on the alternative minimization method, an algorithm is proposed to solve the new model. We evaluate our method by several metrics and compare it with several state-of-the-art related methods on some commonly used datasets. The results demonstrate that our method outperforms the state-of-the-art methods in the vast majority of metrics.
Illustration of the interactions between a user and the recommender system: The recommender system is iteratively utilizing its model to recommend slates (blue) to the user. The user, on the other hand, provides feedback in forms of clicks (green) (Color figure online)
Click rates per model variant and day in the online experiment. The shaded area around each line is the daily calculation of the 68% confidence interval (one standard deviation) of the normal approximation to Bernoulli trials (Color figure online)
Experiment showing the performance of Slate-gru-hier-greedy and all-item-gru-hier-greedy before, during and after the loss of 25% of the search data on the marketplace platform. It also displays the period with the corresponding bug (in grey) when measures are faulty and should not be considered, but are given for completeness. The figure shows that the slate likelihood model performs better than the all-item likelihood model when the dataset consists of slates atu\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$a_t^u$$\end{document} that are less relevant to the user u (Color figure online)
Two dimensional histogram of all items in Slate-gru-hier-greedy: Number of views vs. difference between item mean vector and the items corresponding group mean vector: ||Eϕ(vi)-Eϕ(μg(i)g)||2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$||E_{\phi }(v_i) - E_{\phi }(\mu _{g(i)}^g)||_2$$\end{document}
Two dimensional histogram of all items in Slate-gru-hier-greedy: The number of views vs. the estimated posterior scale parameter σθ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\sigma _{\theta }$$\end{document}. We see that items with fewer views have wider posterior distributions. Darker colour means that more items are present in that histogram basket
We consider the problem of recommending relevant content to users of an internet platform in the form of lists of items, called slates. We introduce a variational Bayesian Recurrent Neural Net recommender system that acts on time series of interactions between the internet platform and the user, and which scales to real world industrial situations. The recommender system is tested both online on real users, and on an offline dataset collected from a Norwegian web-based marketplace,, that is made public for research. This is one of the first publicly available datasets which includes all the slates that are presented to users as well as which items (if any) in the slates were clicked on. Such a data set allows us to move beyond the common assumption that implicitly assumes that users are considering all possible items at each interaction. Instead we build our likelihood using the items that are actually in the slate, and evaluate the strengths and weaknesses of both approaches theoretically and in experiments. We also introduce a hierarchical prior for the item parameters based on group memberships. Both item parameters and user preferences are learned probabilistically. Furthermore, we combine our model with bandit strategies to ensure learning, and introduce ‘in-slate Thompson sampling’ which makes use of the slates to maximise explorative opportunities. We show experimentally that explorative recommender strategies perform on par or above their greedy counterparts. Even without making use of exploration to learn more effectively, click rates increase simply because of improved diversity in the recommended slates.
The graph edit distance is an intuitive measure to quantify the dissimilarity of graphs, but its computation is $$\mathsf {NP}$$ NP -hard and challenging in practice. We introduce methods for answering nearest neighbor and range queries regarding this distance efficiently for large databases with up to millions of graphs. We build on the filter-verification paradigm, where lower and upper bounds are used to reduce the number of exact computations of the graph edit distance. Highly effective bounds for this involve solving a linear assignment problem for each graph in the database, which is prohibitive in massive datasets. Index-based approaches typically provide only weak bounds leading to high computational costs verification. In this work, we derive novel lower bounds for efficient filtering from restricted assignment problems, where the cost function is a tree metric. This special case allows embedding the costs of optimal assignments isometrically into $$\ell _1$$ ℓ 1 space, rendering efficient indexing possible. We propose several lower bounds of the graph edit distance obtained from tree metrics reflecting the edit costs, which are combined for effective filtering. Our method termed EmbAssi can be integrated into existing filter-verification pipelines as a fast and effective pre-filtering step. Empirically we show that for many real-world graphs our lower bounds are already close to the exact graph edit distance, while our index construction and search scales to very large databases.
Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The Minimum Description Length (MDL) principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, we review MDL-based methods for mining different kinds of patterns from various types of data. Finally, we open a discussion on some issues regarding these methods.
Interpretable machine learning has become a very active area of research due to the rising popularity of machine learning algorithms and their inherently challenging interpretability. Most work in this area has been focused on the interpretation of single features in a model. However, for researchers and practitioners, it is often equally important to quantify the importance or visualize the effect of feature groups. To address this research gap, we provide a comprehensive overview of how existing model-agnostic techniques can be defined for feature groups to assess the grouped feature importance, focusing on permutation-based, refitting, and Shapley-based methods. We also introduce an importance-based sequential procedure that identifies a stable and well-performing combination of features in the grouped feature space. Furthermore, we introduce the combined features effect plot, which is a technique to visualize the effect of a group of features based on a sparse, interpretable linear combination of features. We used simulation studies and real data examples to analyze, compare, and discuss these methods.
In critical situations involving discrimination, gender inequality, economic damage, and even the possibility of casualties, machine learning models must be able to provide clear interpretations of their decisions. Otherwise, their obscure decision-making processes can lead to socioethical issues as they interfere with people’s lives. Random forest algorithms excel in the aforementioned sectors, where their ability to explain themselves is an obvious requirement. In this paper, we present LionForests, which relies on a preliminary work of ours. LionForests is a random forest-specific interpretation technique that provides rules as explanations. It applies to binary classification tasks up to multi-class classification and regression tasks, while a stable theoretical background supports it. A time and scalability analysis suggests that LionForests is much faster than our preliminary work and is also applicable to large datasets. Experimentation, including a comparison with state-of-the-art techniques, demonstrate the efficacy of our contribution. LionForests outperformed the other techniques in terms of precision, variance, and response time, but fell short in terms of rule length and coverage. Finally, we highlight conclusiveness, a unique property of LionForests that provides interpretation validity and distinguishes it from previous techniques.
Distribution of the relative runtimes on real datasets, normalized by the median, over 100 runs, of the runtime for the ε\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon $$\end{document}-AUS (which therefore corresponds to the 1.0 line; the absolute runtimes for this median are shown under the dataset name). The whiskers corresponds to minimum and maximum, the extremes of the box to 1st and 3rd quartile, and the line crossing the box to the median
Absolute runtimes on artificial datasets as function of the total number m of itemsets. The median is over 100 runs and the shaded area goes from the minimum to the maximum runtime. The EUS’s scale as well as the ε\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon $$\end{document}-AUS’s
Runtimes of SPEck on BIBLE for θ∈{0.031,0.043,0.06,0.1}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta \in \{0.031, 0.043, 0.06, 0.1\}$$\end{document} (see text for rationale)
We study the problem of efficiently mining statistically-significant sequential patterns from large datasets, under different null models. We consider one null model presented in the literature, and introduce two new ones that preserve different properties of the observed dataset. We describe SPEck, a generic framework for significant sequential pattern mining, that can be instantiated with any null model, when given a procedure for sampling datasets according to the null distribution. For the previously-proposed model, we introduce a novel procedure that samples exactly according to the null distribution, while existing procedures are approximate samplers. Our exact sampler is also more computationally efficient and much faster in practice. For the null models we introduce, we give exact and/or almost uniform samplers. Our experimental evaluation shows how exact samplers can be orders of magnitude faster than approximate ones, and scale well.
A vast and growing literature on explaining deep learning models has emerged. This paper contributes to that literature by introducing a global gradient-based model-agnostic method, which we call Marginal Attribution by Conditioning on Quantiles (MACQ). Our approach is based on analyzing the marginal attribution of predictions (outputs) to individual features (inputs). Specifically, we consider variable importance by fixing (global) output levels, and explaining how features marginally contribute to these fixed global output levels. MACQ can be seen as a marginal attribution counterpart to approaches such as accumulated local effects, which study the sensitivities of outputs by perturbing inputs. Furthermore, MACQ allows us to separate marginal attribution of individual features from interaction effects and to visualize the 3-way relationship between marginal attribution, output level, and feature value.
Individual passenger travel patterns have significant value in understanding passenger’s behavior, such as learning the hidden clusters of locations, time, and passengers. The learned clusters further enable commercially beneficial actions such as customized services, promotions, data-driven urban-use planning, peak hour discovery, and so on. However, the individualized passenger modeling is very challenging for the following reasons: 1) The individual passenger travel data are multi-dimensional spatiotemporal big data, including at least the origin, destination, and time dimensions; 2) Moreover, individualized passenger travel patterns usually depend on the external environment, such as the distances and functions of locations, which are ignored in most current works. This work proposes a multi-clustering model to learn the latent clusters along the multiple dimensions of Origin, Destination, Time, and eventually, Passenger (ODT-P). We develop a graph-regularized tensor Latent Dirichlet Allocation (LDA) model by first extending the traditional LDA model into a tensor version and then applies to individual travel data. Then, the external information of stations is formulated as semantic graphs and incorporated as the Laplacian regularizations; Furthermore, to improve the model scalability when dealing with massive data, an online stochastic learning method based on tensorized variational Expectation-Maximization algorithm is developed. Finally, a case study based on passengers in the Hong Kong metro system is conducted and demonstrates that a better clustering performance is achieved compared to state-of-the-arts with the improvement in point-wise mutual information index and algorithm convergence speed by a factor of two.
We propose Conditional Imputation GAN, an extended missing data imputation method based on Generative Adversarial Networks (GANs). The motivating use case is learning-to-rank, the cornerstone of modern search, recommendation system, and information retrieval applications. Empirical ranking datasets do not always follow standard Gaussian distributions or Missing Completely At Random (MCAR) mechanism, which are standard assumptions of classic missing data imputation methods. Our methodology provides a simple solution that offers compatible imputation guarantees while relaxing assumptions for missing mechanisms and sidesteps approximating intractable distributions to improve imputation quality. We prove that the optimal GAN imputation is achieved for Extended Missing At Random and Extended Always Missing At Random mechanisms, beyond the naive MCAR. Our method demonstrates the highest imputation quality on the open-source Microsoft Research Ranking Dataset and a synthetic ranking dataset compared to state-of-the-art benchmarks and across various feature distributions. Using a proprietary Amazon Search ranking dataset, we also demonstrate comparable ranking quality metrics for ranking models trained on GAN-imputed data compared to ground-truth data.
Plagiarism is a controversial and debated topic in different fields, especially in the Music one, where the commercial market generates a huge amount of money. The lack of objective metrics to decide whether a song is a plagiarism, makes music plagiarism detection a very complex task: often decisions have to be based on subjective argumentations. Automated music analysis methods that identify music similarities can be of help. In this work, we first propose two novel such methods: a text similarity-based method and a clustering-based method. Then, we show how to combine them to get an improved (hybrid) method. The result is a novel adaptive meta-heuristic for music plagiarism detection. To assess the effectiveness of the proposed methods, considered both singularly and in the combined meta-heuristic, we performed tests on a large dataset of ascertained plagiarism and non-plagiarism cases. Results show that the meta-heuristic outperforms existing methods. Finally, we deployed the meta-heuristic into a tool , accessible as a Web application, and assessed the effectiveness, usefulness, and overall user acceptance of the tool by means of a study involving 20 people, divided into two groups, one of which with access to the tool. The study consisted in having people decide which pair of songs, in a predefined set of pairs, should be considered plagiarisms and which not. The study shows that the group supported by our tool successfully identified all plagiarism cases, performing all tasks with no errors. The whole sample agreed about the usefulness of an automatic tool that provides a measure of similarity between two songs.
The representation of objects is crucial for the learning process, often having a large impact on the application performance. The dissimilarity space (DS) is one of such representations, which is built by applying a dissimilarity measure between objects (e.g., Euclidean distance). However, other measures can be applied to generate more informative data representations. This paper focuses on the application of second-order dissimilarity measures, namely the Shared Nearest Neighbor (SNN) and the Dissimilarity Increments (Dinc), to produce new DSs that lead to a better description of the data, by reducing the overlap of the classes and by increasing the discriminative power of features. Experimental results show that the application of the proposed DSs provide significant benefits for unsupervised learning tasks. When compared with Feature and Euclidean space, the proposed SNN and Dinc spaces allow improving the performance of traditional hierarchical clustering algorithms, and also help in the visualization task, by leading to higher area under the precision/recall curve values.
Given an m-by-n real matrix, biclustering aims to discover relevant submatrices. This article defines a new type of bicluster. In any of its columns, the values in the rows of the bicluster must be all strictly greater than those in the rows absent from it, hence the discovery of a binary clustering of the rows in the restricted context of the columns of the bicluster. To only keep the best bicluster among those carrying redundant information, its rows must not be a subset or a superset of the rows of another bicluster of greater or equal quality. Any computable function can be chosen to assign qualities to the biclusters. In that respect, the proposed definition is generic. Dynamic programming and appropriate data structures allow to exhaustively list the biclusters satisfying it within O(m2n+mn2) time, plus the time to compute O(mn) qualities. After some adaptations, the proposed algorithm, Biceps, remains subquadratic if its complexity is expressed in function of mnon-minn, where mnon-min is the maximal number of non-minimal values in a column, i. e., for sparse matrices. Experiments on three real-world datasets demonstrate the effectiveness of the proposal in different application contexts. They also show its good theoretical efficiency is practical as well: two minutes and 5.3 GB of RAM are enough to list the desired biclusters in a dense 801-by-20,531 matrix; 3.5s and 192 MB of RAM for a sparse 631,532-by-174,559 matrix with 2,575,425 nonzero values.
Genetic programming (GP), a widely used Evolutionary Computing technique, suffers from bloat -- the problem of excessive growth in individuals' sizes. As a result, its ability to efficiently explore complex search spaces reduces. The resulting solutions are less robust and generalisable. Moreover, it is difficult to understand and explain models which contain bloat. This phenomenon is well researched, primarily from the angle of controlling bloat: instead, our focus in this paper is to review the literature from an explainability point of view, by looking at how simplification can make GP models more explainable by reducing their sizes. Simplification is a code editing technique whose primary purpose is to make GP models more explainable. However, it can offer bloat control as an additional benefit when implemented and applied with caution. Researchers have proposed several simplification techniques and adopted various strategies to implement them. We organise the literature along multiple axes to identify the relative strengths and weaknesses of simplification techniques and to identify emerging trends and areas for future exploration. We highlight design and integration challenges and propose several avenues for research. One of them is to consider simplification as a standalone operator, rather than an extension of the standard crossover or mutation operators. Its role is then more clearly complementary to other GP operators, and it can be integrated as an optional feature into an existing GP setup. Another proposed avenue is to explore the lack of utilisation of complexity measures in simplification. So far, size is the most discussed measure, with only two pieces of prior work pointing out the benefits of using time as a measure when controlling bloat.
Given the common problem of missing data in real-world applications from various fields, such as remote sensing, ecology and meteorology, the interpolation of missing spatial and spatio-temporal data can be of tremendous value. Existing methods for spatial interpolation, most notably Gaussian processes and spatial autoregressive models, tend to suffer from (a) a trade-off between modelling local or global spatial interaction, (b) the assumption there is only one possible path between two points, and (c) the assumption of homogeneity of intermediate locations between points. Addressing these issues, we propose a value propagation-based spatial interpolation method called VPint, inspired by Markov reward processes (MRPs), and introduce two variants thereof: (i) a static discount (SD-MRP) and (ii) a data-driven weight prediction (WP-MRP) variant. Both these interpolation variants operate locally, while implicitly accounting for global spatial relationships in the entire system through recursion. We evaluated our proposed methods by comparing the mean absolute error, root mean squared error, peak signal-to-noise ratio and structural similarity of interpolated grid cells to those of 8 common baselines. Our analysis involved detailed experiments on a synthetic and two real-world datasets, as well as experiments on convergence and scalability. Empirical results demonstrate the competitive advantage of VPint on randomly missing data, where it performed better than baselines in terms of mean absolute error and structural similarity, as well as spatially clustered missing data, where it performed best on 2 out of 3 datasets.
We propose MultiRocket, a fast time series classification (TSC) algorithm that achieves state-of-the-art accuracy with a tiny fraction of the time and without the complex ensembling structure of many state-of-the-art methods. MultiRocket improves on MiniRocket, one of the fastest TSC algorithms to date, by adding multiple pooling operators and transformations to improve the diversity of the features generated. In addition to processing the raw input series, MultiRocket also applies first order differences to transform the original series. Convolutions are applied to both representations, and four pooling operators are applied to the convolution outputs. When benchmarked using the University of California Riverside TSC benchmark datasets, MultiRocket is significantly more accurate than MiniRocket, and competitive with the best ranked current method in terms of accuracy, HIVE-COTE 2.0, while being orders of magnitude faster.
Datasets with imbalanced class distribution are available in various real-world applications. A great number of approaches has been proposed to address the class imbalance challenge, but most of these models perform poorly when datasets are characterized with high class imbalance, class overlap and low data quality. In this study, we propose an effective meta-framework for high imbalance overlapped classification, called DAPS (DynAmic self-Paced sampling enSemble), which (1) leverages reasonable and effective sampling to maximize the utilization of informative instances and to avoid serious information loss and (2) assigns proper instance weights to address the issues of noisy data. Furthermore, most of the existing canonical classifiers (e.g. Decision Tree, Random Forest) can be integrated in DAPS. The comprehensive experimental results on both synthetic and three real-world datasets show that the DAPS model could obtain considerable improvements in F1-score when compared to a broad range of published models.
Examples of ROC graphs. A ROC curve a can be obtained from a scoring classifier. The dotted line in a corresponds to a classifier with performance comparable to that of a completely random classifier. In b, the Area Under the Curve (AUC) of the corresponding curve is highlighted
Illustrative example of the Area Under the Curve for Clustering (AUCC) procedure: a toy dataset with an arbitrary clustering solution, in which clusters are indicated by a combination of colors and shapes (red diamonds / black circles); b similarity matrix between the data objects of the dataset; c objects are considered in a pairwise fashion and each pair is associated with the corresponding similarity value and cluster assignment (1 if the pair belongs to the same cluster, 0 otherwise). These pairwise representations can be provided as input to a standard ROC Analysis procedure, resulting in an AUC of 0.9167. This is the AUCC assessment of candidate solution (a)
Evaluation of random partitions of 108 synthetic datasets with varied characteristics. Each orange line corresponds to the average of 100 randomly generated partitions when evaluated in a given dataset (total of 108 lines/datasets per plot). The number of clusters in the randomly generated partitions is depicted in the x-axis. Results are stratified on the basis of the cluster size distribution of the generated partitions (plot columns). The red line accounts for the overall mean
Clustering results and ROC Curves. From a to d we show clustering solutions with k=2,3,5and6\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = 2 ,3, 5~\mathrm {and}~6$$\end{document} for the Ruspini dataset, with e their corresponding ROC Curves. From f to i we show clustering solutions with k=7,8,10and11\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k = 7, 8, 10~\mathrm {and}~11$$\end{document} for a simulated dataset, with j their corresponding ROC Curves. Partitions were generated with k-means. We zoom in regions at the bottom left corner to highlight the behaviour of FPR and TPR for solutions under- vs over-estimating the number of clusters for small values of the distance threshold. ROC Curves for the (omitted) partitions with the optimal number of clusters are depicted in black in both (e) and (j)
From a to (f) we depict six datasets with two clusters (200 objects per cluster) obtained from normal distributions centered at (0, 0) and (40, 40). For each dataset, cluster variances are the same for both clusters and each of their coordinates, fixed at 25, 100, 150, 300, 400, and 500, from a to f. In g we depict the ROC Curves for the ground truth-partition of each dataset. Their respective AUC values are 1.00, 0.9923144, 0.9739683, 0.8554201, 0.8129828, and 0.7548744
The area under the receiver operating characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we elaborate on the fact that, in the context of internal/relative clustering validation as we consider, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a much more efficient algorithmic procedure. Our theoretical findings are supported by experimental results. These results show that, in addition to an effective and robust quantitative evaluation provided by AUCC, visual inspection of the ROC curves themselves can be useful to further assess a candidate clustering solution from a broader, qualitative perspective as well.
Results of using the MAR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {MAR}$$\end{document} problem formulation for making a playlist of items. The goal is to maximize the number of activated users. The universe V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V$$\end{document} includes songs, movies or books. A user (a 0–1 activation function fi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f _i$$\end{document}) is activated if they like at least one item among all items they consume within their budget. Markers are jittered horizontally to avoid overlap
MSR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {MSR}$$\end{document} for multiple intents re-ranking in web page ranking. The goal is to maximize the total utility of all user intents within their individual reading budget. The universe V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V$$\end{document} includes documents. The utility of a user intent (a coverage function fi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f _i$$\end{document}) is represented by the coverage rate of its top keywords. Markers are jittered horizontally to avoid overlap
MSR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {MSR}$$\end{document} for sequential data subset selection for kNN models. The goal is to boost the average predictive accuracy of kNN models. The universe V\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$V$$\end{document} includes all data points. The sum of the surrogate objective function fi\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$f _i$$\end{document} (reduction of radii) for each model is optimized. Markers are jittered horizontally to avoid overlap
Running time of all methods for the task of making a synthetic playlist
Submodular maximization has been the backbone of many important machine-learning problems, and has applications to viral marketing, diversification, sensor placement, and more. However, the study of maximizing submodular functions has mainly been restricted in the context of selecting a set of items. On the other hand, many real-world applications require a solution that is a ranking over a set of items. The problem of ranking in the context of submodular function maximization has been considered before, but to a much lesser extent than item-selection formulations. In this paper, we explore a novel formulation for ranking items with submodular valuations and budget constraints. We refer to this problem as max-submodular ranking (MSR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {MSR}$$\end{document}). In more detail, given a set of items and a set of non-decreasing submodular functions, where each function is associated with a budget, we aim to find a ranking of the set of items that maximizes the sum of values achieved by all functions under the budget constraints. For the MSR\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\text {MSR}$$\end{document} problem with cardinality- and knapsack-type budget constraints we propose practical algorithms with approximation guarantees. In addition, we perform an empirical evaluation, which demonstrates the superior performance of the proposed algorithms against strong baselines.
A large number of covariates can have a negative impact on the quality of causal effect estimation since confounding adjustment becomes unreliable when the number of covariates is large relative to the number of samples. Propensity score is a common way to deal with a large covariate set, but the accuracy of propensity score estimation (normally done by logistic regression) is also challenged by the large number of covariates. In this paper, we prove that a large covariate set can be reduced to a lower dimensional representation which captures the complete information for adjustment in causal effect estimation. The theoretical result enables effective data-driven algorithms for causal effect estimation. Supported by the result, we develop an algorithm that employs a supervised kernel dimension reduction method to learn a lower dimensional representation from the original covariate space, and then utilises nearest neighbour matching in the reduced covariate space to impute the counterfactual outcomes to avoid the large sized covariate set problem. The proposed algorithm is evaluated on two semisynthetic and three real-world datasets and the results show the effectiveness of the proposed algorithm.
Artificial intelligence (AI) has achieved notable performances in many fields and its research impact in healthcare has been unquestionable. Nevertheless, the deployment of such computational models in clinical practice is still limited. Some of the major issues recognized as barriers to a successful real-world machine learning applications include lack of: transparency; reliability and personalization. Actually, these aspects are decisive not only for patient safety, but also to assure the confidence of professionals. Explainable AI aims at to achieve solutions for artificial intelligence transparency and reliability concerns, with the capacity to better understand and trust a model, providing the ability to justify its outcomes, thus effectively assisting clinicians in rationalizing the model prediction. This work proposes an innovative machine learning based approach, implementing a hybrid scheme, able to combine in a systematic way knowledge-driven and data-driven techniques. In a first step a global set of interpretable rules is generated, founded on clinical evidence. Then, in a second phase, a machine learning model is trained to select, from the global set of rules, the subset that is more appropriate for a given patient, according to his particular characteristics. This approach addresses simultaneously three of the central requirements of explainable AI—interpretability, personalization, and reliability—without impairing the accuracy of the model’s prediction. The scheme was validated with a real dataset provided by two Portuguese Hospitals, the Santa Cruz Hospital, Lisbon, and the Santo André Hospital, Leiria, comprising a total of N = 1111 patients that suffered an acute coronary syndrome event, where the 30 days mortality was assessed. When compared with standard black-box structures (e.g. feedforward neural network) the proposed scheme achieves similar performances, while ensures simultaneously clinical interpretability and personalization of the model, as well as provides a level of reliability to the estimated mortality risk.
The statistical comparison of machine learning classifiers is frequently underpinned by null hypothesis significance testing. Here, we provide a survey and analysis of underrated problems that significance testing entails for classification benchmark studies. The p-value has become deeply entrenched in machine learning, but it is substantially less objective and less informative than commonly assumed. Even very small p-values can drastically overstate the evidence against the null hypothesis. Moreover, the p-value depends on the experimenter’s intentions, irrespective of whether these were actually realized or not. We show how such intentions can lead to experimental designs with more than one stage, and how to calculate a valid p-value for such designs. We discuss two widely used statistical tests for the comparison of classifiers, the Friedman test and the Wilcoxon signed rank test. Some improvements to the use of p-values, such as the calibration with the Bayes factor bound, and alternative methods for the evaluation of benchmark studies are discussed as well.
Being able to capture the characteristics of a time series with a feature vector is a very important task with a multitude of applications, such as classification, clustering or forecasting. Usually, the features are obtained from linear and nonlinear time series measures, that may present several data related drawbacks. In this work we introduce NetF as an alternative set of features, incorporating several representative topological measures of different complex networks mappings of the time series. Our approach does not require data preprocessing and is applicable regardless of any data characteristics. Exploring our novel feature vector, we are able to connect mapped network features to properties inherent in diversified time series models, showing that NetF can be useful to characterize time data. Furthermore, we also demonstrate the applicability of our methodology in clustering synthetic and benchmark time series sets, comparing its performance with more conventional features, showcasing how NetF can achieve high-accuracy clusters. Our results are very promising, with network features from different mapping methods capturing different properties of the time series, adding a different and rich feature set to the literature.
Efficient and interpretable classification of time series is an essential data mining task with many real-world applications. Recently several dictionary- and shapelet-based time series classification methods have been proposed that employ contiguous subsequences of fixed length. We extend pattern mining to efficiently enumerate long variable-length sequential patterns with gaps. Additionally, we discover patterns at multiple resolutions thereby combining cohesive sequential patterns that vary in length, duration and resolution. For time series classification we construct an embedding based on sequential pattern occurrences and learn a linear model. The discovered patterns form the basis for interpretable insight into each class of time series. The pattern-based embedding for time series classification (PETSC) supports both univariate and multivariate time series datasets of varying length subject to noise or missing data. We experimentally validate that MR-PETSC performs significantly better than baseline interpretable methods such as DTW, BOP and SAX-VSM on univariate and multivariate time series. On univariate time series, our method performs comparably to many recent methods, including BOSS, cBOSS, S-BOSS, ProximityForest and ResNET, and is only narrowly outperformed by state-of-the-art methods such as HIVE-COTE, ROCKET, TS-CHIEF and InceptionTime. Moreover, on multivariate datasets PETSC performs comparably to the current state-of-the-art such as HIVE-COTE, ROCKET, CIF and ResNET, none of which are interpretable. PETSC scales to large datasets and the total time for training and making predictions on all 85 ‘bake off’ datasets in the UCR archive is under 3 h making it one of the fastest methods available. PETSC is particularly useful as it learns a linear model where each feature represents a sequential pattern in the time domain, which supports human oversight to ensure predictions are trustworthy and fair which is essential in financial, medical or bioinformatics applications.
The paper presents a computational approach to Availability of soccer players. Availability is defined as the probability that a pass reaches the target player without being intercepted by opponents. Clearly, a computational model for this probability grounds on models for ball dynamics, player movements, and technical skills of the pass giver. Our approach aggregates these quantities for all possible passes to the target player to compute a single Availability value. Empirically, our approach outperforms state-of-the-art competitors using data from 58 professional soccer matches. Moreover, our experiments indicate that the model can even outperform soccer coaches in assessing the availability of soccer players from static images.
We present XEM, an eXplainable-by-design Ensemble method for Multivariate time series classification. XEM relies on a new hybrid ensemble method that combines an explicit boosting-bagging approach to handle the bias-variance trade-off faced by machine learning models and an implicit divide-and-conquer approach to individualize classifier errors on different parts of the training data. Our evaluation shows that XEM outperforms the state-of-the-art MTS classifiers on the public UEA datasets. Furthermore, XEM provides faithful explainability-by-design and manifests robust performance when faced with challenges arising from continuous data collection (different MTS length, missing data and noise).
An illustration of the data normalisation step, and the rationale for using the inverse cosine similarity as the dissimilarity measure. Left: The original data points. Right: The data points that have been normalised to lie on the unit sphere
A geometric illustration of the necessity for stretching points in X
Median running times (in log-scale of seconds) of different algorithms on the MNIST handwritten digits data
Performance of subspace clustering methods on Hopkins155 database
Clustering accuracy of constrained clustering methods on the selected Hopkins155 datasets with respect to varying proportions of labelled data. Queried points are selected through the active learning strategy of Peng and Pavlidis (2019)
Spectral-based subspace clustering methods have proved successful in many challenging applications such as gene sequencing, image recognition, and motion segmentation. In this work, we first propose a novel spectral-based subspace clustering algorithm that seeks to represent each point as a sparse convex combination of a few nearby points. We then extend the algorithm to a constrained clustering and active learning framework. Our motivation for developing such a framework stems from the fact that typically either a small amount of labelled data are available in advance; or it is possible to label some points at a cost. The latter scenario is typically encountered in the process of validating a cluster assignment. Extensive experiments on simulated and real datasets show that the proposed approach is effective and competitive with state-of-the-art methods.
Explanations for an instance x of the german classified as a denied loan by a DNN. In red, non-actionable changes (Color figure online)
Aggregated metrics for explainers implemented by existing libraries varying the required number of counterfactuals k. Best view in color (Color figure online)
Aggregated metrics for explainers not based on optimization strategies varying the number of counterfactuals k. Best view in color. (Color figure online)
Aggregated metrics for explainers not based on optimization strategies varying the number of required counterfactuals k and the distance function adopted. Best view in color. (Color figure online)
Interpretable machine learning aims at unveiling the reasons behind predictions returned by uninterpretable classifiers. One of the most valuable types of explanation consists of counterfactuals. A counterfactual explanation reveals what should have been different in an instance to observe a diverse outcome. For instance, a bank customer asks for a loan that is rejected. The counterfactual explanation consists of what should have been different for the customer in order to have the loan accepted. Recently, there has been an explosion of proposals for counterfactual explainers. The aim of this work is to survey the most recent explainers returning counterfactual explanations. We categorize explainers based on the approach adopted to return the counterfactuals, and we label them according to characteristics of the method and properties of the counterfactuals returned. In addition, we visually compare the explanations, and we report quantitative benchmarking assessing minimality, actionability, stability, diversity, discriminative power, and running time. The results make evident that the current state of the art does not provide a counterfactual explainer able to guarantee all these properties simultaneously.
Time series data remains a perennially important datatype considered in data mining. In the last decade there has been an increasing realization that time series data can be best understood by reasoning about time series subsequences on the basis of their similarity to other subsequences: the two most familiar such time series concepts being motifs and discords. Time series motifs refer to two particularly close subsequences, whereas time series discords indicate subsequences that are far from their nearest neighbors. However, we argue that it can sometimes be useful to simultaneously reason about a subsequence’s closeness to certain data and its distance to other data. In this work we introduce a novel primitive called the Contrast Profile that allows us to efficiently compute such a definition in a principled way. As we will show, the Contrast Profile has many downstream uses, including anomaly detection, data exploration, and preprocessing unstructured data for classification. We demonstrate the utility of the Contrast Profile by showing how it allows end-to-end classification in datasets with tens of billions of datapoints, and how it can be used to explore datasets and reveal subtle patterns that might otherwise escape our attention. Moreover, we demonstrate the generality of the Contrast Profile by presenting detailed case studies in domains as diverse as seismology, animal behavior, and cardiology.
The probability density of pa\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_a$$\end{document} for each action in environment two (left) and environment three (right)
The average reward obtained with the Bernstein bound (top row) and with the clipping bound (upper middle row) in tasks sampled from environment one. The values of the Bernstein bound (lower middle row) and the clipping bound (bottom row) are also shown. The solid lines show the mean average reward and the shaded regions show the mean ± 1 standard deviation
The average reward obtained with the Bernstein bound (top row) and with the clipping bound (upper middle row) in tasks sampled from environment two. The values of the Bernstein bound (lower middle row) and the clipping bound (bottom row) are also shown. The solid lines show the mean average reward and the shaded regions show the mean ± 1 standard deviation
The average reward obtained with the Bernstein bound (top row) and with the clipping bound (upper middle row) in tasks sampled from environment three. The values of the Bernstein bound (lower middle row) and the clipping bound (bottom row) are also shown. The solid lines show the mean average reward and the shaded regions show the mean ± 1 standard deviation
The average reward obtained with the Bernstein bound (top row) and with the clipping bound (upper middle row) in tasks sampled from environment three, starting from a more informative hyperprior. The values of the Bernstein bound (lower middle row) and the clipping bound (bottom row) are also shown. The solid lines show the mean average reward and the shaded regions show the mean ± 1 standard deviation
We present a PAC-Bayesian analysis of lifelong learning. In the lifelong learning problem, a sequence of learning tasks is observed one-at-a-time, and the goal is to transfer information acquired from previous tasks to new learning tasks. We consider the case when each learning task is a multi-armed bandit problem. We derive lower bounds on the expected average reward that would be obtained if a given multi-armed bandit algorithm was run in a new task with a particular prior and for a set number of steps. We propose lifelong learning algorithms that use our new bounds as learning objectives. Our proposed algorithms are evaluated in several lifelong multi-armed bandit problems and are found to perform better than a baseline method that does not use generalisation bounds.
Real-world datasets are often characterised by outliers; data items that do not follow the same structure as the rest of the data. These outliers might negatively influence modelling of the data. In data analysis it is, therefore, important to consider methods that are robust to outliers. In this paper we develop a robust regression method that finds the largest subset of data items that can be approximated using a sparse linear model to a given precision. We show that this can yield the best possible robustness to outliers. However, this problem is NP-hard and to solve it we present an efficient approximation algorithm, termed SLISE. Our method extends existing state-of-the-art robust regression methods, especially in terms of speed on high-dimensional datasets. We demonstrate our method by applying it to both synthetic and real-world regression problems.
Complex systems, abstractly represented as networks, are ubiquitous in everyday life. Analyzing and understanding these systems requires, among others, tools for community detection. As no single best community detection algorithm can exist, robustness across a wide variety of problem settings is desirable. In this work, we present Synwalk, a random walk-based community detection method. Synwalk builds upon a solid theoretical basis and detects communities by synthesizing the random walk induced by the given network from a class of candidate random walks. We thoroughly validate the effectiveness of our approach on synthetic and empirical networks, respectively, and compare Synwalk’s performance with the performance of Infomap and Walktrap (also random walk-based), Louvain (based on modularity maximization) and stochastic block model inference. Our results indicate that Synwalk performs robustly on networks with varying mixing parameters and degree distributions. We outperform Infomap on networks with high mixing parameter, and Infomap and Walktrap on networks with many small communities and low average degree. Our work has a potential to inspire further development of community detection via synthesis of random walks and we provide concrete ideas for future research.
The increasing value of data held in enterprises makes it an attractive target to attackers. The increasing likelihood and impact of a cyber attack have highlighted the importance of effective cyber risk estimation. We propose two methods for modelling Value-at-Risk (VaR) which can be used for any time-series data. The first approach is based on Quantile Autoregression (QAR), which can estimate VaR for different quantiles, i. e. confidence levels. The second method, we term Competitive Quantile Autoregression (CQAR), dynamically re-estimates cyber risk as soon as new data becomes available. This method provides a theoretical guarantee that it asymptotically performs as well as any QAR at any time point in the future. We show that these methods can predict the size and inter-arrival time of cyber hacking breaches by running coverage tests. The proposed approaches allow to model a separate stochastic process for each significance level and therefore provide more flexibility compared to previously proposed techniques. We provide a fully reproducible code used for conducting the experiments.
In this work, we present a dimensionality reduction algorithm, aka. sketching, for categorical datasets. Our proposed sketching algorithm Cabin constructs low-dimensional binary sketches from high-dimensional categorical vectors, and our distance estimation algorithm Cham computes a close approximation of the Hamming distance between any two original vectors only from their sketches. The minimum dimension of the sketches required by Cham to ensure a good estimation theoretically depends only on the sparsity of the data points—making it useful for many real-life scenarios involving sparse datasets. We present a rigorous theoretical analysis of our approach and supplement it with extensive experiments on several high-dimensional real-world data sets, including one with over a million dimensions. We show that the Cabin and Cham duo is a significantly fast and accurate approach for tasks such as RMSE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {RMSE}$$\end{document}, all-pair similarity, and clustering when compared to working with the full dataset and other dimensionality reduction techniques.
On a randomized stream generated from the data schema and distribution shown above in (a), EFDT first splits after 8–9 examples, then splits again after 12–13 examples to build the correct tree. VFDT takes around 69 examples for the first split, and makes the second split at around 73 examples. EFDT greatly increases statistical efficiency without compromising the use of a rigorous statistical test to determine split attributes, and revises splits in order to converge to the ideal batch tree.
Decision tree ensembles are widely used in practice. In this work, we study in ensemble settings the effectiveness of replacing the split strategy for the state-of-the-art online tree learner, Hoeffding Tree, with a rigorous but more eager splitting strategy that we had previously published as Hoeffding AnyTime Tree. Hoeffding AnyTime Tree (HATT), uses the Hoeffding Test to determine whether the current best candidate split is superior to the current split, with the possibility of revision, while Hoeffding Tree aims to determine whether the top candidate is better than the second best and if a test is selected, fixes it for all posterity. HATT converges to the ideal batch tree while Hoeffding Tree does not. We find that HATT is an efficacious base learner for online bagging and online boosting ensembles. On UCI and synthetic streams, HATT as a base learner outperforms HT at a 0.05 significance level for the majority of tested ensembles on what we believe is the largest and most comprehensive set of testbenches in the online learning literature. Our results indicate that HATT is a superior alternative to Hoeffding Tree in a large number of ensemble settings.
Node classification example. Two masterpieces of Michelangelo need to be classified in either Fresco or Sculpture based on their linked information. The “type” relations are, for obvious reasons, not taken into account during this evaluation
Three different techniques to perform KG node classification. a Describes the INK pipeline discussed in this paper. b The RDF2Vec generated embeddings and c the rather black-box graph convolutional neural networks
Example of how a Decision Tree classifier can be used in combination with the INK features
The average time measurements over 5 runs for all 7 benchmark datasets for the best results techniques and parameters as reported in Table 4. The time measurements are visualised in function of the depth
The average used memory in gigabytes over 5 runs for all 7 benchmark datasets, for the best results, techniques and parameters as reported in Table 4. The used memory is visualised in function of the depth
Deep learning techniques are increasingly being applied to solve various machine learning tasks that use Knowledge Graphs as input data. However, these techniques typically learn a latent representation for the entities of interest internally, which is then used to make decisions. This latent representation is often not comprehensible to humans, which is why deep learning techniques are often considered to be black boxes. In this paper, we present INK: Instance Neighbouring by using Knowledge, a novel technique to learn binary feature-based representations, which are comprehensible to humans, for nodes of interest in a knowledge graph. We demonstrate the predictive power of the node representations obtained through INK by feeding them to classical machine learning techniques and comparing their predictive performances for the node classification task to the current state of the art: Graph Convolutional Networks (R-GCN) and RDF2Vec. We perform this comparison both on benchmark datasets and using a real-world use case.
When searching for information in a data collection, we are often interested not only in finding relevant items, but also in assembling a diverse set, so as to explore different concepts that are present in the data. This problem has been researched extensively. However, finding a set of items with minimal pairwise similarities can be computationally challenging, and most existing works striving for quality guarantees assume that item relatedness is measured by a distance function. Given the widespread use of similarity functions in many domains, we believe this to be an important gap in the literature. In this paper we study the problem of finding a diverse set of items, when item relatedness is measured by a similarity function. We formulate the diversification task using a flexible, broadly applicable minimization objective, consisting of the sum of pairwise similarities of the selected items and a relevance penalty term. To find good solutions we adopt a randomized rounding strategy, which is challenging to analyze because of the cardinality constraint present in our formulation. Even though this obstacle can be overcome using dependent rounding, we show that it is possible to obtain provably good solutions using an independent approach, which is faster, simpler to implement and completely parallelizable. Our analysis relies on a novel bound for the ratio of Poisson-Binomial densities, which is of independent interest and has potential implications for other combinatorial-optimization problems. We leverage this result to design an efficient randomized algorithm that provides a lower-order additive approximation guarantee. We validate our method using several benchmark datasets, and show that it consistently outperforms the greedy approaches that are commonly used in the literature.
A ubiquitous presence of sequence data across fields, like, web, healthcare, bioinformatics, text mining, etc., has made sequence mining a vital research area. However, sequence mining is particularly challenging because of absence of an accurate and fast approach to find (dis)similarity between sequences. As a measure of (dis)similarity, mainstream data mining methods like k-means, kNN, regression, etc., have proved distance between data points in a euclidean space to be most effective. But a distance measure between sequences is not obvious due to their unstructuredness --- arbitrary strings of arbitrary length. We, therefore, propose a new function, called as Sequence Graph Transform (SGT), that extracts sequence features and embeds it in a finite-dimensional euclidean space. It is scalable due to a low computational complexity and has a universal applicability on any sequence problem. We theoretically show that SGT can capture both short and long patterns in sequences, and provides an accurate distance-based measure of (dis)similarity between them. This is also validated experimentally. Finally, we show its real world application for clustering, classification, search and visualization on different sequence problems.
This paper deals with the problem of modeling counterfactual reasoning in scenarios where, apart from the observed endogenous variables, we have a latent variable that affects the outcomes and, consequently, the results of counterfactuals queries. This is a common setup in healthcare problems, including mental health. We propose a new framework where the aforementioned problem is modeled as a multivariate regression and the counterfactual model accounts for both observed and a latent variable, where the latter represents what we call the patient individuality factor ( φ ). In mental health, focusing on individuals is paramount, as past experiences can change how people see or deal with situations, but individuality cannot be directly measured. To the best of our knowledge, this is the first counterfactual approach that considers both observational and latent variables to provide deterministic answers to counterfactual queries, such as: what if I change the social support of a patient, to what extent can I change his/her anxiety? The framework combines concepts from deep representation learning and causal inference to infer the value of φ and capture both non-linear and multiplicative effects of causal variables. Experiments are performed with both synthetic and real-world datasets, where we predict how changes in people's actions may lead to different outcomes in terms of symptoms of mental illness and quality of life. Results show the model learns the individually factor with errors lower than 0.05 and answers counterfactual queries that are supported by the medical literature. The model has the potential to recommend small changes in people's lives that may completely change their relationship with mental illness.
Through the quantification of physical activity energy expenditure (PAEE), health care monitoring has the potential to stimulate vital and healthy ageing, inducing behavioural changes in older people and linking these to personal health gains. To be able to measure PAEE in a health care perspective, methods from wearable accelerometers have been developed, however, mainly targeted towards younger people. Since elderly subjects differ in energy requirements and range of physical activities, the current models may not be suitable for estimating PAEE among the elderly. Furthermore, currently available methods seem to be either simple but non-generalizable or require elaborate (manual) feature construction steps. Because past activities influence present PAEE, we propose a modeling approach known for its ability to model sequential data, the recurrent neural network (RNN). To train the RNN for an elderly population, we used the growing old together validation (GOTOV) dataset with 34 healthy participants of 60 years and older (mean 65 years old), performing 16 different activities. We used accelerometers placed on wrist and ankle, and measurements of energy counts by means of indirect calorimetry. After optimization, we propose an architecture consisting of an RNN with 3 GRU layers and a feedforward network combining both accelerometer and participant-level data. Our efforts included switching mean to standard deviation for down-sampling the input data and combining temporal and static data (person-specific details such as age, weight, BMI). The resulting architecture produces accurate PAEE estimations while decreasing training input and time by a factor of 10. Subsequently, compared to the state-of-the-art, it is capable to integrate longer activity data which lead to more accurate estimations of low intensity activities EE. It can thus be employed to investigate associations of PAEE with vitality parameters of older people related to metabolic and cognitive health and mental well-being.
Estimated ball trajectory (yellow dots) compared to the measured data-points from the tracking data. The video footage of the pass can be found here: (Color figure online)
Visualization of potential pass angles reaching the intended target. Players’ movement vectors are displayed as arrows. The passing player (Robin Koch, #2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\#2$$\end{document} of the blue team) as well as the receiving player (Emre Can, #3\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\#3$$\end{document} of the blue team) are highlighted in yellow. The same pass is displayed as in Fig. 1 and in this video: (Color figure online)
Estimated target of a pass with ball-trajectory and movement models. The combination of the estimated ball trajectory (yellow dots) and the player movement model (blue and red circles) predict Julian Draxler (#7\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\# 7$$\end{document} of the blue team) reaching the ball first, with the arrow indicating the first point where he could potentially intercept the pass. The respective video sequence can be viewed here: (Color figure online)
Pass probabilities of hypothetical passes. This is the same situation as displayed and described in Fig. 3. The video sequence can be found here:
Feature influence to the xPass model (Table 3, row 1) based on SHAP-values
Passes are by far football’s (soccer) most frequent event, yet surprisingly little meaningful research has been devoted to quantify them. With the increase in availability of so-called positional data, describing the positioning of players and ball at every moment of the game, our work aims to determine the difficulty of every pass by calculating its success probability based on its surrounding circumstances. As most experts will agree, not all passes are of equal difficulty, however, most traditional metrics count them as such. With our work we can quantify how well players can execute passes, assess their risk profile, and even compute completion probabilities for hypothetical passes by combining physical and machine learning models. Our model uses the first 0.4 seconds of a ball trajectory and the movement vectors of all players to predict the intended target of a pass with an accuracy of $$93.0\%$$ 93.0 % for successful and $$72.0\%$$ 72.0 % for unsuccessful passes much higher than any previously published work. Our extreme gradient boosting model can then quantify the likelihood of a successful pass completion towards the identified target with an area under the curve (AUC) of $$93.4\%$$ 93.4 % . Finally, we discuss several potential applications, like player scouting or evaluating pass decisions.
a A simple graph G\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {G}}{}{^{}}$$\end{document} that is stratified into four strata {I1,I2,I3,I4}\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\{{\mathcal {I}}{}{^{}}_1, {\mathcal {I}}{}{^{}}_2, {\mathcal {I}}{}{^{}}_3, {\mathcal {I}}{}{^{}}_4\}$$\end{document}. b–d The second, third and fourth graph strata constructed by Definition 5. In the multi-graph G2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {G}}{}{^{}}_{2}$$\end{document} (Fig. 1b), vertices in I1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {I}}{}{^{}}_{1}$$\end{document} are collapsed into ζ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta _{2}$$\end{document} and only edges incident on I2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {I}}{}{^{}}_{2}$$\end{document} are preserved. The edge set therefore contains J2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {J}}{}{^{}}_{2}$$\end{document} and the edges between ζ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta _{2}$$\end{document} and I2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {I}}{}{^{}}_{2}$$\end{document}. Consequently, self-loops on ζ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta _{2}$$\end{document} and edges between I3:4\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {I}}{}{^{}}_{3:4}$$\end{document} are absent. c, d follow suit. In each stratum Gr\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {G}}{}{^{}}_{r}$$\end{document}, RWTs from ζr\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta _{r}$$\end{document} are started by sampling u.a.r. from the dotted edges and estimates are computed over the solid edges
Accuracy and convergence analysis for 5-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5-$$\end{document}CIS s. We plot the L2-norm between the Ripple estimate and the exact value of the count vector C5\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {C}}{}{^{5}}$$\end{document} (Eq. 9) of all non-isomorphic subgraph patterns against various configurations of the parameters ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document} and |I1|\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|{\mathcal {I}}{}{^{}}_1|$$\end{document}. As expected, the accuracy improves as the error bound ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document} decreases and the number of seed subgraphs |I1|\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|{\mathcal {I}}{}{^{}}_1|$$\end{document} increases. Each box and whisker represents 10 runs
Convergence of Ripple estimates of 12-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$12-$$\end{document}CIS pattern counts. We estimate the total number of subgraphs |V12|\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|{\mathcal {V}}{}{^{12}}|$$\end{document} and the number of sparse patterns and stars. Estimates over 10 runs are presented as box and whiskers plots, which exhibit a reduction in variance as ϵ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\epsilon $$\end{document} increases. Indeed, almost all patterns are sparse, and the most frequent substructure is a star
Parallel RWTs and reservoirs: a The set of mRWTs sampled on G2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {G}}{}{^{}}_{2}$$\end{document} in parallel, where the supernode ζ2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\zeta _{2}$$\end{document} is colored black. The gray, blue, red and green colors represent states in stratum 2–5, respectively. b The upper triangular reservoir matrix in which the cell in the r-th row and t-th column contains samples from U^r,t\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\widehat{\mathbf{U}}_{r,t}$$\end{document}
This work considers the general task of estimating the sum of a bounded function over the edges of a graph, given neighborhood query access and where access to the entire network is prohibitively expensive. To estimate this sum, prior work proposes Markov chain Monte Carlo (MCMC) methods that use random walks started at some seed vertex and whose equilibrium distribution is the uniform distribution over all edges, eliminating the need to iterate over all edges. Unfortunately, these existing estimators are not scalable to massive real-world graphs. In this paper, we introduce Ripple, an MCMC-based estimator that achieves unprecedented scalability by stratifying the Markov chain state space into ordered strata with a new technique that we denote sequential stratified regenerations. We show that the Ripple estimator is consistent, highly parallelizable, and scales well. We empirically evaluate our method by applying Ripple to the task of estimating connected, induced subgraph counts given some input graph. Therein, we demonstrate that Ripple is accurate and can estimate counts of up to 12-node subgraphs, which is a task at a scale that has been considered unreachable, not only by prior MCMC-based methods but also by other sampling approaches. For instance, in this target application, we present results in which the Markov chain state space is as large as 1043\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10^{43}$$\end{document}, for which Ripple computes estimates in less than 4 h, on average.
Illustration of the effect of the STC\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {stc}$$\end{document} principle. Solid edges correspond to strong ties and dashed edges to weak ties. Observe (in red) the effect of strengthening tie (4, 5): there is increased chance for ties (1, 5) and (3, 5) to be created. On the other hand, the STC\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textsc {stc}$$\end{document} principle does not stipulate creation of tie (2, 5) as (2, 4) is weak (Color figure online)
Construction of graph H used in the proof of Lemma 1
High-level description of our algorithmic pipeline using the example of Fig. 1 and for k=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=2$$\end{document}. On the left hand side is the initial graph. Then, the graph is transformed into the corresponding wedge graph. Edge-vertices (1, 4) and (3, 4) are highlighted in black since they correspond to fixed vertices in the k-DENSIFY\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k{{\text {-}}\textsc {Densify}} $$\end{document} instance (edges (1, 4) and (3, 4) are strong). The final step is to obtain the optimal k-DENSIFY\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k{{\text {-}}\textsc {Densify}} $$\end{document} solution for k=2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k=2$$\end{document} (vertices in red) (Color figure online)
Performance comparison of all algorithms
Running time comparison of all algorithms
Online social networks provide a forum where people make new connections, learn more about the world, get exposed to different points of view, and access information that were previously inaccessible. It is natural to assume that content-delivery algorithms in social networks should not only aim to maximize user engagement but also to offer opportunities for increasing connectivity and enabling social networks to achieve their full potential. Our motivation and aim is to develop methods that foster the creation of new connections, and subsequently, improve the flow of information in the network. To achieve our goal, we propose to leverage the strong triadic closure principle, and consider violations to this principle as opportunities for creating more social links. We formalize this idea as an algorithmic problem related to the densest k-subgraph problem. For this new problem, we establish hardness results and propose approximation algorithms. We identify two special cases of the problem that admit a constant-factor approximation. Finally, we experimentally evaluate our proposed algorithm on real-world social networks, and we additionally evaluate some simpler but more scalable algorithms.
Discrete Markov chains are frequently used to analyse transition behaviour in sequential data. Here, the transition probabilities can be estimated using varying order Markov chains, where order k specifies the length of the sequence history that is used to model these probabilities. Generally, such a model is fitted to the entire dataset, but in practice it is likely that some heterogeneity in the data exists and that some sequences would be better modelled with alternative parameter values, or with a Markov chain of a different order. We use the framework of Exceptional Model Mining (EMM) to discover these exceptionally behaving sequences. In particular, we propose an EMM model class that allows for discovering subgroups with transition behaviour of varying order. To that end, we propose three new quality measures based on information-theoretic scoring functions. Our findings from controlled experiments show that all three quality measures find exceptional transition behaviour of varying order and are reasonably sensitive. The quality measure based on Akaike’s Information Criterion is most robust for the number of observations. We furthermore add to existing work by seeking for subgroups of sequences, as opposite to subgroups of transitions. Since we use sequence-level descriptive attributes, we form subgroups of entire sequences, which is practically relevant in situations where you want to identify the originators of exceptional sequences, such as patients. We show this relevance by analysing sequences of blood glucose values of adult persons with diabetes type 2. In the experiments, we find subgroups of patients based on age and glycated haemoglobin (HbA1c), a measure known to correlate with average blood glucose values. Clinicians and domain experts confirmed the transition behaviour as estimated by the fitted Markov chain models.
Temporal graphs are structures which model relational data between entities that change over time. Due to the complex structure of data, mining statistically significant temporal subgraphs, also known as temporal motifs, is a challenging task. In this work, we present an efficient technique for extracting temporal motifs in temporal networks. Our method is based on the novel notion of egocentric temporal neighborhoods, namely multi-layer structures centered on an ego node. Each temporal layer of the structure consists of the first-order neighborhood of the ego node, and corresponding nodes in sequential layers are connected by an edge. The strength of this approach lies in the possibility of encoding these structures into a unique bit vector, thus bypassing the problem of graph isomorphism in searching for temporal motifs. This allows our algorithm to mine substantially larger motifs with respect to alternative approaches. Furthermore, by bringing the focus on the temporal dynamics of the interactions of a specific node, our model allows to mine temporal motifs which are visibly interpretable. Experiments on a number of complex networks of social interactions confirm the advantage of the proposed approach over alternative non-egocentric solutions. The egocentric procedure is indeed more efficient in revealing similarities and discrepancies among different social environments, independently of the different technologies used to collect data, which instead affect standard non-egocentric measures.
Many systems can be expressed as multivariate state sequences (MSS) in terms of entities and their states with evolving dependencies over time. In order to interpret the temporal dynamics in such data, it is essential to capture relationships between entities and their changes in state and dependence over time under uncertainty. Existing probabilistic models do not explicitly model the evolution of causality between dependent state sequences and mostly result in complex structures when representing complete causal dependencies between random variables. To solve this, Temporal State Change Bayesian Networks (TSCBN) are introduced to effectively model interval relations of MSSs under evolving uncertainty. Our model outperforms competing approaches in terms of parameter complexity and expressiveness. Further, an efficient structure discovery method for TSCBNs is presented, that improves classical approaches by exploiting temporal knowledge and multiple parameter estimation approaches for TSCBNs are introduced. Those are expectation maximization, variational inference and a sampling based maximum likelihood estimation that allow to learn parameters from partially observed MSSs. Lastly, we demonstrate how TSCBNs allow to interpret and infer patterns of captured sequences for specification mining in automotive.
A bipartite graph and its two (i, j)-cores, where circles and squares indicate the left and right vertices, respectively, and the dashed lines represent the (2, 2)-core and the (3, 2)-core. In general, the (3, 2)-core has a greater density than (2, 2)-core, which can be computed by |E(G)|/(|L(G)|·|R(G)|)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$|E(G) |/(|L(G) |\cdot |R(G) |)$$\end{document}. The density of a complete bipartite graph should be 1
The direct relation graph of (i, j)-cores in a complete graph with 3 left vertices and 3 right vertices
The steps for updating (2, 2)-core with two inserted edges are as follows: a the previous (2, 2)-core is marked by a black dashed line and two inserted edges are specified by a green dotted line; b the quasi-(2, 2)-core is shown by a purple half-dashed line; c the candidate graph is indicated by a blue half-dashed line; d the partial-(2, 2)-core is indicated by a red half-dash line; e the current (2, 2)-core is indicated by a dashed line (Color figure online)
The steps of updating (2, 2)-core with two removed edges are as follows: a the previous (2, 2)-core is indicated by a black dashed line, and two removed edges are noted by a green dotted line; b the quasi-(2, 2)-core is demonstrated by a purple half-dashed line; c the affected vertices and edges are implied by a red half-dashed line; d the current (2, 2)-core is stated by a black dashed line (Color figure online)
The following are the processes involved in updating the (2, 2)-core utilizing the current (3, 2)-core and two inserted edges: a the previous (2, 2)-core equals to the current (3, 2)-core, and they are both marked by a black dash-dotted line, with two inserted edges marked by a green dotted line; b the quasi-(2, 2)-core is indicated by a purple half-dashed line; c the candidate graph is highlighted by a blue half-dashed line; d the quasi-(2, 2)-core is depicted by a red half-dashed line; e the current (2, 2)-core is shown by a black dashed line and a dash-dotted line (Color figure online)
k-core is important in many graph mining applications, such as community detection and clique finding. As one generalized concept of k-core, (i, j)-core is more suited for bipartite graph analysis since it can identify the functions of two different types of vertices. Because (i, j)-cores evolve as edges are inserted into (removed from) a dynamic bipartite graph, it is more economical to maintain them rather than decompose the graph recursively when only a few edges change. Moreover, many applications (e.g., graph visualization) only focus on some dense (i, j)-cores. Existing solutions are simply insufficiently adaptable. They must maintain all (i, j)-cores rather than just a subset of them, which requires more effort. To solve this issue, we propose novel maintenance methods for updating expected (i, j)-cores. To estimate the influence scope of inserted (removed) edges, we first construct quasi-(i, j)-cores, which loosen the constraint of (i, j)-cores but have similar properties. Second, we present a bottom-up approach for efficiently maintaining all (i, j)-cores, from sparse to dense. Thirdly, because certain applications only focus on dense (i, j)-cores of top-n layers, we also propose a top-down approach to maintain (i, j)-cores from dense to sparse. Finally, we conduct extensive experiments to validate the efficiency of proposed approaches. Experimental results show that our maintenance solutions outperform existing approaches by one order of magnitude.
Data-to-Text Generation (DTG) is a subfield of Natural Language Generation aiming at transcribing structured data in natural language descriptions. The field has been recently boosted by the use of neural-based generators which exhibit on one side great syntactic skills without the need of hand-crafted pipelines; on the other side, the quality of the generated text reflects the quality of the training data, which in realistic settings only offer imperfectly aligned structure-text pairs. Consequently, state-of-art neural models include misleading statements –usually called hallucinations—in their outputs. The control of this phenomenon is today a major challenge for DTG, and is the problem addressed in the paper. Previous work deal with this issue at the instance level: using an alignment score for each table-reference pair. In contrast, we propose a finer-grained approach, arguing that hallucinations should rather be treated at the word level. Specifically, we propose a Multi-Branch Decoder which is able to leverage word-level labels to learn the relevant parts of each training instance. These labels are obtained following a simple and efficient scoring procedure based on co-occurrence analysis and dependency parsing. Extensive evaluations, via automated metrics and human judgment on the standard WikiBio benchmark, show the accuracy of our alignment labels and the effectiveness of the proposed Multi-Branch Decoder. Our model is able to reduce and control hallucinations, while keeping fluency and coherence in generated texts. Further experiments on a degraded version of ToTTo show that our model could be successfully used on very noisy settings.
Top-cited authors
Anthony Bagnall
  • University of East Anglia
Germain Forestier
  • Université de Haute-Alsace
Jonathan Weber
  • Université de Haute-Alsace
Lhassane Idoumghar
  • Université de Haute-Alsace
Pierre-Alain Muller
  • Université de Haute-Alsace