Learning classifiers from distributed, semantically heterogeneous, autonomous data sources /

Source: OAI


Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities for large-scale data-driven knowledge acquisition with the potential for fundamental gains in scientific understanding (e.g., characterization of macromolecular structure-function relationships in biology) in many data-rich domains. In such applications, the data sources of interest are typically physically distributed, semantically heterogeneous and autonomously owned and operated, which makes it impossible to use traditional machine learning algorithms for knowledge acquisition. However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output and we use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data is available in a central location) and they compare favorably to their centralized counterparts in terms of time and communication complexity. To deal with the semantical heterogeneity problem, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to define mappings and conversion functions needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. That is further used to extend our approach for learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data. The work described above contributed to the design and implementation of AirlDM, a collection of data source independent machine learning algorithms through the means of sufficient statistics and data source wrappers, and to the design of INDUS, a federated, query-centric system for knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources. Typescript (photocopy) Thesis (Ph.D.)--Iowa State University, 2004. Includes bibliography.

Download full-text


Available from: Doina Caragea, Jul 28, 2015
  • Source
    • "Energy, time, cost, security and privacy concerns prevent however the full centralization of data in a single computation node, as required by many traditional data mining algorithms. Methods providing the same or almost the same performance of such centralized solution but dealing with the domain constraints like synchronization, communication overhead and scalability are hence of primary interest in a Department of Electronics, Computer Science and Systems, University of Bologna, Italy b Department of Informatics, Federico Santa María University, Chile practice and have oriented recent research efforts, ranging from general frameworks [32] [3] [5] to model specific methods [14] [27] [2]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We explore a technique to learn Support Vector Models (SVMs) when training data is partitioned among several data sources. The basic idea is to consider SVMs which can be reduced to Minimal Enclosing Ball (MEB) problems in an feature space. Computation of such SVMs can be efficiently achieved by finding a core- set for the image of the data in the feature space. Our main result is that the union of local core-sets provides a close approximation to a global core-set from which the SVM can be recovered. The method requires hence a single pass through each source of data in order to compute local core-sets and then to recover the SVM from its union. Extensive simulations in small and large datasets are presented in order to evaluate its classification accuracy, transmission efficiency and global complexity, comparing its results with a widely used single-pass heuristic to learn standard SVMs.
    Proceedings of the SIAM International Conference on Data Mining, SDM 2010, April 29 - May 1, 2010, Columbus, Ohio, USA; 01/2010
  • Source
    • "Provost et al [10] survey work on scaling up learning algorithms . Examples of approaches that have been explored include parallelization of specific algorithms [6], support for disk resident data [1], and learning decision trees from statistical queries [9] [3] [2]. WekaDB [13] enables WEKA implementations of learning algorithms to be used with data that reside in a relational database. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe an approach to learning predictive mod- els from large databases in settings where direct access to data is not available because of massive size of data, access restrictions, or bandwidth requirements. We outline some techniques for minimizing the number of statistical queries needed; and for efficiently coping with missing values in the data. We provide source open implementation of the deci- sion tree and Naive bayes algorithms the demonstrate the feasibility of the proposed approach.
    2008 IEEE / WIC / ACM International Conference on Web Intelligence, WI 2008, 9-12 December 2008, Sydney, NSW, Australia, Main Conference Proceedings; 01/2008
  • Source
    • "More precisely, we have applied the general framework described above for construction of algorithms for learning classifiers from distributed data to design provably exact algorithms for learning Naïve Bayes, Nearest Neighbor, Bayes Network, Neural Network, and Decision Tree classifiers from distributed data under horizontal and vertical data fragmentation [Caragea, 2004a], and Support Vector Machine (SVM) classifiers under horizontal data fragmentation (at the expense of multiple passes through the distributed data) [Caragea et al., 2004; Honavar & Caragea, 2008]. We have also established the precise conditions under which the proposed algorithms offer significant savings in bandwidth, memory, and/or computation time (relative to their centralized counterparts) [Caragea, 2004; Honavar & Caragea, 2008]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We summarize progress on algorithms and software knowledge acquisition from large, distributed, autonomous, and semantically disparate information sources. Some key results include: scalable algorithms for constructing predictive models from data based on a novel decomposition of learning algorithms that interleaves queries for sufficient statistics from data with computations using the statistics; provably exact algorithms from distributed data (relative to their centralized counterparts); and statistically sound approaches to learning predictive models from partially specified data that arise in settings where the schema and the data semantics and hence the granularity of data differ across the different sources.
    Semantic Scientific Knowledge Integration, Papers from the 2008 AAAI Spring Symposium, Technical Report SS-08-05, Stanford, California, USA, March 26-28, 2008; 01/2008
Show more