Learning classifiers from distributed, semantically heterogeneous, autonomous data sources /

Source: OAI

ABSTRACT Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities for large-scale data-driven knowledge acquisition with the potential for fundamental gains in scientific understanding (e.g., characterization of macromolecular structure-function relationships in biology) in many data-rich domains. In such applications, the data sources of interest are typically physically distributed, semantically heterogeneous and autonomously owned and operated, which makes it impossible to use traditional machine learning algorithms for knowledge acquisition. However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output and we use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data is available in a central location) and they compare favorably to their centralized counterparts in terms of time and communication complexity. To deal with the semantical heterogeneity problem, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to define mappings and conversion functions needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. That is further used to extend our approach for learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data. The work described above contributed to the design and implementation of AirlDM, a collection of data source independent machine learning algorithms through the means of sufficient statistics and data source wrappers, and to the design of INDUS, a federated, query-centric system for knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources. Typescript (photocopy) Thesis (Ph.D.)--Iowa State University, 2004. Includes bibliography.

  • [Show abstract] [Hide abstract]
    ABSTRACT: The work deals with the distributed machine learning. Distributed learning from data is considered to be an important challenge faced by researchers and practice in the domain of the distributed data mining and distributed knowledge discovery from databases. Currently, learning from data is recognized as one of the most widely investigated paradigms of machine learning. At the same time it is perceived as a difficult and demanding computational problem. Even more complex and still to a large extent open is learning from the distributed data. One of the approaches suitable for learning from the geographically distributed data is to select from the local databases relevant local patterns, called also prototypes. Such prototypes are selected using some specialized data reduction methods. The dissertation contains an overview of the problem of learning classifiers from data, followed by a discussion of the distributed learning. The above includes the problem formulation and the state-of-the-art review. Next, data reduction, approaches, techniques and algorithms are discussed. The central part of the dissertation proposes an agent-based distributed learning framework. The idea is to carry-out data reduction in parallel in separate locations, employing specialized software agents. The process ends when locally selected prototypes are moved to a central site and merged into the global knowledge model. The following part of the work contains the results of an extensive computational experiment aiming at validation of the proposed approach. Finally, conclusions and suggestions for further research are formulated.
    T. Computational Collective Intelligence. 01/2011; 4:3-121.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This papers proposes a new scalable and robust distributed algorithm for constructing distributed decision trees in peer-to-peer environment for the heterogeneous data sources. Computation and communication cost in the peer-to-peer environment is higher and also on chances of reducedaccuracy and response time may be higher. Proposed algorithm scales good and also provides the best prediction model in the well known classification technique of distributed decision trees
    International Journal on Computer Science and Engineering. 01/2011;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We explore a technique to learn Support Vector Models (SVMs) when training data is partitioned among several data sources. The basic idea is to consider SVMs which can be reduced to Minimal Enclosing Ball (MEB) problems in an feature space. Computation of such SVMs can be efficiently achieved by finding a core- set for the image of the data in the feature space. Our main result is that the union of local core-sets provides a close approximation to a global core-set from which the SVM can be recovered. The method requires hence a single pass through each source of data in order to compute local core-sets and then to recover the SVM from its union. Extensive simulations in small and large datasets are presented in order to evaluate its classification accuracy, transmission efficiency and global complexity, comparing its results with a widely used single-pass heuristic to learn standard SVMs.
    Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio, USA; 01/2010


1 Download
Available from