Conference Paper

Multiple Information Sources Cooperative Learning.

Conference: IJCAI 2009, Proceedings of the 21st International Joint Conference on Artificial Intelligence, Pasadena, California, USA, July 11-17, 2009
Source: DBLP

ABSTRACT Many applications are facing the problem of learn- ing from an objective dataset, whereas information from other auxiliary sources may be beneficial but cannot be integrated into the objective dataset for learning. In this paper, we propose an omni-view learning approach to enable learning from multi- ple data collections. The theme is to organize het- erogeneous data sources into a unified table with global data view. To achieve the omni-view learn- ing goal, we consider that the objective dataset and the auxiliary datasets share some instance-level dependency structures. We then propose a rela- tional k-means to cluster instances in each auxil- iary dataset, such that clusters can help build new features to capture correlations between the objec- tive and auxiliary datasets. Experimental results demonstrate that omni-view learning can help build models which outperform the ones learned from the objective dataset only. Comparisons with the co-training algorithm further assert that omni-view learning provides an alternative, yet effective, way for semi-supervised learning.

  • [Show abstract] [Hide abstract]
    ABSTRACT: The purpose of data mining from distributed information systems is usually threefold: (1) identifying locally significant patterns in individual databases; (2) discovering emerging significant patterns after unifying distributed databases in a single view; and (3) finding patterns which follow special relationships across different data collections. While existing research has significantly advanced the techniques for mining local and global patterns (the first two goals), very little attempt has been made to discover patterns across distributed databases (the third goal). Moreover, no framework currently exists to support the mining of all three types of patterns. This paper proposes solutions to discover patterns from distributed databases. More specifically, we consider pattern mining as a query process where the purpose is to discover patterns from distributed databases with patterns' relationships satisfying user specified query constraints. We argue that existing self-contained mining frameworks are neither efficient, nor feasible to fulfill the objective, mainly because their pattern pruning is single-database oriented. To solve the problem, we advocate a cross-database pruning concept and propose a collaborative pattern (CLAP) mining framework with cross-database pruning mechanisms for distributed pattern mining. In CLAP, distributed databases collaboratively exchange pattern information between sites so that each site can leverage information from other sites to gain cross-database pruning. Experimental results show that CLAP fits a niche position, and demonstrate that CLAP not only outperforms its other peers with significant runtime performance gains, but also helps find patterns incapable of being discovered by others.
    Decision Support Systems. 01/2011; 52:40-51.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Multiple information sources for the same set of objects can provide different representations, and combining their advantages may improve the predictive power for a given task. However, it is noticeable that some sources might be irrelevant or redundant. Thus, it is meaningful to select a set of good information sources that could help improve the learning performance, and very little work has been reported on this topic. In this paper, we first identify the two aspects of quality of information, source significance and source redundancy. In particular, significance represents the degree to which an information source contributes to the classification, and redundancy implies the information overlap among different information sources. We then propose a metric that combines neighborhood mutual information with a Max-Significance–Min-Redundancy algorithm, allowing us to select a compact set of superior information sources for classification learning. Extensive experiments show that the metric is very helpful in finding good information sources, and that the proposed method outperforms many other methods.
    Neurocomputing 01/2014; 133:95–102. · 1.63 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Data stream classification has drawn increasing attention from the data mining community in recent years. Relevant applications include network traffic monitoring, sensor network data analysis, Web click stream mining, power consumption measurement, dynamic tracing of stock fluctuations, to name a few. Data stream classification in such real-world applications is typically subject to three major challenges: concept drifting, large volumes, and partial labeling. As a result, training examples in data streams can be very diverse and it is very hard to learn accurate models with efficiency. In this paper, we propose a novel framework that first categorizes diverse training examples into four types and assign learning priorities to them. Then, we derive four learning cases based on the proportion and priority of the different types of training examples. Finally, for each learning case, we employ one of the four SVM-based training models: classical SVM, semi-supervised SVM, transfer semi-supervised SVM, and relational k-means transfer semi-supervised SVM. We perform comprehensive experiments on real-world data streams that validate the utility of our approach.
    Neurocomputing. 09/2012; 92:170–182.


Available from