Article

Learning classifiers from distributed, semantically heterogeneous, autonomous data sources /

Source: OAI

ABSTRACT Recent advances in computing, communications, and digital storage technologies, together with development of high throughput data acquisition technologies have made it possible to gather and store large volumes of data in digital form. These developments have resulted in unprecedented opportunities for large-scale data-driven knowledge acquisition with the potential for fundamental gains in scientific understanding (e.g., characterization of macromolecular structure-function relationships in biology) in many data-rich domains. In such applications, the data sources of interest are typically physically distributed, semantically heterogeneous and autonomously owned and operated, which makes it impossible to use traditional machine learning algorithms for knowledge acquisition. However, we observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output and we use this observation to design a general strategy for transforming traditional algorithms for learning from data into algorithms for learning from distributed data. The resulting algorithms are provably exact in that the classifiers produced by them are identical to those obtained by the corresponding algorithms in the centralized setting (i.e., when all of the data is available in a central location) and they compare favorably to their centralized counterparts in terms of time and communication complexity. To deal with the semantical heterogeneity problem, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to define mappings and conversion functions needed to answer statistical queries from semantically heterogeneous data viewed from a certain user perspective. That is further used to extend our approach for learning from distributed data into a theoretically sound approach to learning from semantically heterogeneous data. The work described above contributed to the design and implementation of AirlDM, a collection of data source independent machine learning algorithms through the means of sufficient statistics and data source wrappers, and to the design of INDUS, a federated, query-centric system for knowledge acquisition from distributed, semantically heterogeneous, autonomous data sources. Typescript (photocopy) Thesis (Ph.D.)--Iowa State University, 2004. Includes bibliography.

0 Bookmarks
 · 
100 Views
  • [Show abstract] [Hide abstract]
    ABSTRACT: Incremental learning of neural networks has attracted much interest in recent years due to its wide applicability to large scale data sets and to distributed learning scenarios. Moreover, nonstationary learning paradigms have also emerged as a subarea of study in Machine Learning literature due to the problems of classical methods when dealing with data set shifts. In this paper we present an algorithm to train single layer neural networks with nonlinear output functions that take into account incremental, nonstationary and distributed learning scenarios. Moreover, it is demonstrated that introducing a regularization term into the proposed model is equivalent to choosing a particular initialization for the devised training algorithm, which may be suitable for real time systems that have to work under noisy conditions. In addition, the algorithm includes some previous models as special cases and can be used as a block component to build more complex models such as multilayer perceptrons, extending the capacity of these models to incremental, nonstationary and distributed learning paradigms. In this paper, the proposed algorithm is tested with standard data sets and compared with previous approaches, demonstrating its higher accuracy.
    Pattern Recognition 12/2012; 45(12):4536-4546. DOI:10.1016/j.patcog.2012.05.009 · 2.58 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper discusses the issues related to the process of global decision-making on the basis of knowledge which is stored in a dispersed form (several local knowledge bases or classifiers). In the paper a decision-making system is described. In this system, the classification process of the test object can be divided into several steps. In the first step, we investigate how particular classifiers classify a test object. We describe this using probability vectors over decision classes. We cluster classifiers with respect to similarities of the probability vectors. For every cluster, we find a kind of combined information. Finally, we classify the given test object by voting among clusters, using the combined information from each of clusters. The paper proposes a new approach to the organization of the structure of a decision-making system, which operates on the basis of dispersed knowledge. In the presented system, the classifiers are combined into groups called clusters in a dynamic way. We seek to designate groups of classifiers that classify the test object in a similar manner. The groups of classifiers are not disjoint sets. We use overlapping clusters because this is a more suitable representation of classification compatibility. It is assumed that, if the classifier classifies the test object in an ambiguous way, it should belong to several clusters. Then, a process of the elimination of inconsistencies in the knowledge is implemented in the created groups. Global decisions are made by using one of the methods for the analysis of conflicts.
    Information Sciences 06/2014; 270:172–191. DOI:10.1016/j.ins.2014.02.076 · 3.89 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The issues that are related to the process of global decision-making on the basis of knowledge which is stored in a dispersed form (several local knowledge bases or classifiers) are discussed in this paper. In a decision-making system, which is described in the paper, the classification process of the test object starts with an investigation of how particular classifiers classify a test object. We describe the views of classifiers by using probability vectors over decision classes. In the system, the process of combining classifiers in coalitions is very important. Negotiation is used in the clustering process. We define three types of relations between classifiers: friendship, conflict and neutrality. The clustering process consists of two stages. In the first step, the initial groups are created. These groups contain classifiers that are in a friendship relation. In the second stage, classifiers which are in neutrality relation are attached to the existing groups. In this paper, a formal description of the clustering process is presented and mathematical properties of functions, which are used, are described. For every cluster, we find a kind of combined information. Finally, we classify the given test object by voting among clusters, using the combined information from each of the clusters. In the paper a new way of creating clusters (with a negotiation stage) is compared to the approach presented in the paper (Przybyła-Kasperek and Wakulicz-Deja, 2014) [23] (without negotiations). There are significant differences between the clusters that are generated using these two approaches, which are shown in the paper. In the new approach, the clusters are more complex and better reconstruct and illustrate the views of the classifiers on the classification.
    Information Sciences 12/2014; 288:194–219. DOI:10.1016/j.ins.2014.07.032 · 3.89 Impact Factor

Preview

Download
1 Download
Available from