Conference Paper

Redescription Model Mining

Authors:
If you want to read the PDF, try requesting it from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Transfer learning aims at improving the performance of target learners on target domains by transferring the knowledge contained in different but related source domains. In this way, the dependence on a large number of target-domain data can be reduced for constructing target learners. Due to the wide application prospects, transfer learning has become a popular and promising area in machine learning. Although there are already some valuable and impressive surveys on transfer learning, these surveys introduce approaches in a relatively isolated way and lack the recent advances in transfer learning. Due to the rapid expansion of the transfer learning area, it is both necessary and challenging to comprehensively review the relevant studies. This survey attempts to connect and systematize the existing transfer learning research studies, as well as to summarize and interpret the mechanisms and the strategies of transfer learning in a comprehensive way, which may help readers have a better understanding of the current research status and ideas. Unlike previous surveys, this survey article reviews more than 40 representative transfer learning approaches, especially homogeneous transfer learning approaches, from the perspectives of data and model. The applications of transfer learning are also briefly introduced. In order to show the performance of different transfer learning models, over 20 representative transfer learning models are used for experiments. The models are performed on three different data sets, that is, Amazon Reviews, Reuters-21578, and Office-31, and the experimental results demonstrate the importance of selecting appropriate transfer learning models for different applications in practice.
Article
Full-text available
In this work, we present a redescription mining algorithm that uses Random Forest of Predictive Clustering Trees (RFPCTs) for generating and iteratively improving a set of redescriptions. The approach uses information about element membership in different queries, generated from a single constructed PCT, to explore redescription space, while queries obtained from the Random Forest of PCTs increase candidate diversity. The approach is able to produce highly accurate, statistically significant redescriptions described by Boolean, nominal or numerical attributes. As opposed to current tree-based approaches that use multi-class or binary classification, we explore the benefits of using multi-label classification and multi-target regression to create redescriptions. Major benefit of the approach, compared to other state of the art solutions, is that it does not require specifying minimal threshold on redescription accuracy to obtain highly accurate, optimized set of redescriptions. The process of Random Forest based augmentation and different modes of redescription set creation are evaluated on three datasets with different properties. We use the same datasets to compare the performance of our algorithm to state of the art redescription mining approaches.
Article
Full-text available
Exceptional Model Mining strives to find coherent subgroups of the dataset where multiple target attributes interact in an unusual way. One instance of such an investigated form of interaction is Pearson’s correlation coefficient between two targets. EMM then finds subgroups with an exceptionally linear relation between the targets. In this paper, we enrich the EMM toolbox by developing the more general rank correlation model class. We find subgroups with an exceptionally monotone relation between the targets. Apart from catering for this richer set of relations, the rank correlation model class does not necessarily require the assumption of target normality, which is implicitly invoked in the Pearson’s correlation model class. Furthermore, it is less sensitive to outliers. We provide pseudocode for the employed algorithm and analyze its computational complexity, and experimentally illustrate what the rank correlation model class for EMM can find for you on six datasets from an eclectic variety of domains.
Article
Full-text available
Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.
Article
Full-text available
Exceptional Model Mining (EMM) is an exploratory data analysis technique that can be regarded as a generalization of subgroup discovery. In EMM we look for subgroups of the data for which a model fitted to the subgroup differs substantially from the same model fitted to the entire dataset. In this paper we develop methods to mine for exceptional regression models. We propose a measure for the exceptionality of regression models (Cook's distance), and explore the possibilities to avoid having to fit the regression model to each candidate subgroup. The algorithm is evaluated on a number of real life datasets. These datasets are also used to illustrate the results of the algorithm. We find interesting subgroups with deviating models on datasets from several different domains. We also show that under certain circumstances one can forego fitting regression models on up to 40% of the subgroups, and these 40% are the relatively expensive regression models to compute.
Conference Paper
Full-text available
Exceptional model mining has been proposed as a variant of subgroup discovery especially focusing on complex target concepts. Currently, efficient mining algorithms are limited to heuristic (non exhaustive) methods. In this paper, we propose a novel approach for fast exhaustive exceptional model mining: We introduce the concept of valuation bases as an intermediate condensed data representation, and present the general GP-growth algorithm based on FP-growth. Furthermore, we discuss the scope of the proposed approach by drawing an analogy to data stream mining and provide examples for several different model classes. Runtime experiments show improvements of more than an order of magnitude in comparison to a naive exhaustive depth-first search.
Article
Full-text available
Data mining algorithms look for patterns in data. While most existing data mining approaches look for patterns in a single data table, multi-relational data mining (MRDM) approaches look for patterns that involve multiple tables (relations) from a relational database. In recent years, the most common types of patterns and approaches considered in data mining have been extended to the multi-relational case and MRDM now encompasses multi-relational (MR) as-sociation rule discovery, MR decision trees and MR distance-based methods, among others. MRDM approaches have been successfully applied to a number of problems in a va-riety of areas, most notably in the area of bioinformatics. This article provides a brief introduction to MRDM, while the remainder of this special issue treats in detail advanced research topics at the frontiers of MRDM.
Article
Full-text available
Subgroup discovery is a data mining technique which extracts interesting rules with respect to a target variable. An important characteristic of this task is the combination of predictive and descriptive induction. An overview related to the task of subgroup discovery is presented. This review focuses on the foundations, algorithms, and advanced studies together with the applications of subgroup discovery presented throughout the specialised bibliography. KeywordsSubgroup discovery–Knowledge discovery
Conference Paper
Full-text available
Whenever a dataset has multiple discrete target variables, we want our algorithms to consider not only the variables themselves, but also the interdependencies between them. We propose to use these interdependencies to quantify the quality of subgroups, by integrating Bayesian networks with the Exceptional Model Mining framework. Within this framework, candidate subgroups are generated. For each candidate, we fit a Bayesian network on the target variables. Then we compare the network's structure to the structure of the Bayesian network fitted on the whole dataset. To perform this comparison, we define an edit distance-based distance metric that is appropriate for Bayesian networks. We show interesting subgroups that we experimentally found with our method on datasets from music theory, semantic scene classification, biology and zoogeography.
Conference Paper
Full-text available
In most databases, it is possible to identify small partitions of the data where the observed distribution is notably different from that of the database as a whole. In classical subgroup discovery, one considers the distribution of a single nominal attribute, and exceptional subgroups show a surprising increase in the occurrence of one of its values. In this paper, we introduce Exceptional Model Mining (EMM), a framework that allows for more complicated target concepts. Rather than finding subgroups based on the distribution of a single target attribute, EMM finds subgroups where a model fitted to that subgroup is somehow exceptional. We discuss regression as well as classification models, and define quality measures that determine how exceptional a given model on a subgroup is. Our framework is general enough to be applied to many types of models, even from other paradigms such as association analysis and graphical modeling.
Article
Full-text available
Interestingness measures play an important role in data mining, regardless of the kind of patterns being mined. These measures are intended for selecting and ranking patterns according to their potential interest to the user. Good measures also allow the time and space costs of the mining process to be reduced. This survey reviews the interestingness measures for rules and summaries, classifies them from several perspectives, compares their properties, identifies their roles in the data mining process, gives strategies for selecting appropriate measures for applications, and identifies opportunities for future research in this area.
Chapter
We strive to find contexts (i.e., subgroups of entities) under which exceptional (dis-)agreement occurs among a group of individuals, in any type of data featuring individuals (e.g., parliamentarians, customers) performing observable actions (e.g., votes, ratings) on entities (e.g., legislative procedures, movies). To this end, we introduce the problem of discovering statistically significant exceptional contextual intra-group agreement patterns. To handle the sparsity inherent to voting and rating data, we use Krippendorff’s Alpha measure for assessing the agreement among individuals. We devise a branch-and-bound algorithm, named DEvIANT, to discover such patterns. DEvIANT exploits both closure operators and tight optimistic estimates. We derive analytic approximations for the confidence intervals (CIs) associated with patterns for a computationally efficient significance assessment. We prove that these approximate CIs are nested along specialization of patterns. This allows to incorporate pruning properties in DEvIANT to quickly discard non-significant patterns. Empirical study on several datasets demonstrates the efficiency and the usefulness of DEvIANT.
Chapter
In scientific investigations, data oftentimes differ in nature; for instance, they might originate from distinct sources or be cast over separate terminologies. In order to gain insight into the phenomenon of interest, an intuitive first task is to identify the correspondences that exist between these different aspects. This is the motivating principle behind redescription mining, a data analysis task that aims at finding distinct common characterizations of the same objects. In this chapter, we provide the basic definitions of redescription mining, including the data model, query languages, similarity measures, p-value calculations, and methods for pruning redundant redescriptions. We will also briefly cover related data analysis methods and provide a short history of redescription mining research.
Conference Paper
We present a new method for detecting interpretable subgroups with exceptional transition behavior in sequential data. Identifying such patterns has many potential applications, e.g., for studying human mobility or analyzing the behavior of internet users. To tackle this task, we employ exceptional model mining, which is a general approach for identifying interpretable data subsets that exhibit unusual interactions between a set of target attributes with respect to a certain model class. Although exceptional model mining provides a well-suited framework for our problem, previously investigated model classes cannot capture transition behavior. To that end, we introduce first-order Markov chains as a novel model class for exceptional model mining and present a new interestingness measure that quantifies the exceptionality of transition subgroups. The measure compares the distance between the Markov transition matrix of a subgroup and the respective matrix of the entire data with the distance of random dataset samples. In addition, our method can be adapted to find subgroups that match or contradict given transition hypotheses. We demonstrate that our method is consistently able to recover subgroups with exceptional transition models from synthetic data and illustrate its potential in two application examples. Our work is relevant for researchers and practitioners interested in detecting exceptional transition behavior in sequential data.
Article
Finding subsets of a dataset that somehow deviate from the norm, i.e. where something interesting is going on, is a classical Data Mining task. In traditional local pattern mining methods, such deviations are measured in terms of a relatively high occurrence (frequent itemset mining), or an unusual distribution for one designated target attribute (common use of subgroup discovery). These, however, do not encompass all forms of “interesting”. To capture a more general notion of interestingness in subsets of a dataset, we develop Exceptional Model Mining (EMM). This is a supervised local pattern mining framework, where several target attributes are selected, and a model over these targets is chosen to be the target concept. Then, we strive to find subgroups: subsets of the dataset that can be described by a few conditions on single attributes. Such subgroups are deemed interesting when the model over the targets on the subgroup is substantially different from the model on the whole dataset. For instance, we can find subgroups where two target attributes have an unusual correlation, a classifier has a deviating predictive performance, or a Bayesian network fitted on several target attributes has an exceptional structure. We give an algorithmic solution for the EMM framework, and analyze its computational complexity. We also discuss some illustrative applications of EMM instances, including using the Bayesian network model to identify meteorological conditions under which food chains are displaced, and using a regression model to find the subset of households in the Chinese province of Hunan that do not follow the general economic law of demand.
Conference Paper
For the task of subgroup discovery, generalization-aware interesting measures that are based not only on the statistics of the patterns itself, but also on the statistics of their generalizations have recently been shown to be essential. A key technique to increase runtime performance of subgroup discovery algorithms is the application of optimistic estimates to limit the search space size. These are upper bounds for the interestingness that any specialization of the currently evaluated pattern may have. Until now these estimates are based on the anti-monotonicity of instances, which are covered by the current pattern. This neglects important properties of generalizations. Therefore, we present in this paper a new scheme of deriving optimistic estimates for generalization aware subgroup discovery, which is based on the instances by which patterns differ in comparison to their generalizations. We show, how this technique can be applied for the most popular interestingness measures for binary as well as for numeric target concepts. The novel bounds are incorporated in an efficient algorithm, which outperforms previous methods by up to an order of magnitude.
Conference Paper
Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche-finding, is of much importance in biology. Current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a pre-requisite, entailing a possibly harmful loss of information. In niche-finding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot. In this paper, we extend redescription mining to categorical and real-valued data with possibly missing values using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behavior of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012
Conference Paper
Given a 0-1 dataset, we consider the redescription mining task introduced by Ramakrishnan, Parida, and Zaki. The problem is to find subsets of the rows that can be (approxi- mately) defined by at least two different Boolean formulae on the attributes. That is, we search for pairs (α, β )o f Boolean formulae such that the implications α → β and β → α both hold with high accuracy. We require that the two descrip- tions α and β are syntactically sufficiently different. Such pairs of descriptions indicate that the subset has different definitions, a fact that gives useful information about the data. We give simple algorithms for this task, and evalu- ate their performance. The methods are based on pruning the search space of all possible pairs of formulae by different accuracy criteria. The significance of the findings is tested by using randomization methods. Experimental results on simulated and real data show that the methods work well: on simulated data they find the planted subsets, and on real data they produce small and understandable results.
  • Galbrun Esther
Bert: Pre-training of deep bidirectional transformers for language understanding
  • Jacob Devlin
  • Ming-Wei Chang
  • Kenton Lee
  • Kristina Toutanova
  • Devlin Jacob