Page 1

Learning classifiers from distributed, semantically heterogeneous, autonomous

data sources

by

Doina Caragea

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Computer Science

Program of Study Committee:

Vasant Honavar, Major Professor

Dianne Cook

Drena Dobbs

David Fernandez-Baca

Leslie Miller

Iowa State University

Ames, Iowa

2004

Copyright c ? Doina Caragea, 2004. All rights reserved.

Page 2

ii

Graduate College

Iowa State University

This is to certify that the doctoral dissertation of

Doina Caragea

has met the dissertation requirements of Iowa State University

Major Professor

For the Major Program

Page 3

iii

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii

LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiv

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

1INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.1Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.2 Traditional Machine Learning Limitations. . . . . . . . . . . . . . . . . . . .4

1.3Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

1.4Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

1.4.1 Distributed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

1.4.2Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.3Learning Classifiers from Heterogeneous Data . . . . . . . . . . . . . . 18

1.5Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

2LEARNING CLASSIFIERS FROM DATA. . . . . . . . . . . . . . . . . 23

2.1Machine Learning Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

2.2Learning from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3Examples of Algorithms for Learning from Data . . . . . . . . . . . . . . . . .27

2.3.1Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .30

2.3.3Perceptron Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.4 Support Vector Machines and Related Large Margin Classifiers . . . . .33

2.3.5k Nearest Neighbors Classifiers . . . . . . . . . . . . . . . . . . . . . . 40

Page 4

iv

2.4 Decomposition of Learning Algorithms into Information Extraction and Hy-

pothesis Generation Components . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.6Examples of Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.6.1 Sufficient Statistics for Naive Bayes Classifiers . . . . . . . . . . . . . .47

2.6.2 Sufficient Statistics for Decision Trees . . . . . . . . . . . . . . . . . . .47

2.6.3Sufficient Statistics for Perceptron Algorithm . . . . . . . . . . . . . . . 48

2.6.4 Sufficient Statistics for SVM . . . . . . . . . . . . . . . . . . . . . . . .50

2.6.5 Sufficient Statistics for k-NN . . . . . . . . . . . . . . . . . . . . . . . . 51

2.7 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

3LEARNING CLASSIFIERS FROM DISTRIBUTED DATA . . . . . . . 54

3.1Learning from Distributed Data . . . . . . . . . . . . . . . . . . . . . . . . . .54

3.2 General Strategy for Learning from Distributed Data . . . . . . . . . . . . . .58

3.3Algorithms for Learning Classifiers from Distributed Data . . . . . . . . . . .60

3.3.1Learning Naive Bayes Classifiers from Distributed Data . . . . . . . . . 61

3.3.2 Learning Decision Tree Classifiers from Distributed Data . . . . . . . .68

3.3.3Horizontally Fragmented Distributed Data . . . . . . . . . . . . . . . . 68

3.3.4 Learning Threshold Functions from Distributed Data . . . . . . . . . .78

3.3.5 Learning Support Vector Machines from Distributed Data. . . . . . . 84

3.3.6 Learning k Nearest Neighbor Classifiers from Distributed Data . . . . .92

3.4Statistical Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.4.1Operator Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4 LEARNING CLASSIFIERS FROM SEMANTICALLY HETERO-

GENEOUS DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.1Integration of the Data at the Semantic Level . . . . . . . . . . . . . . . . . . 107

4.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.1.2Ontology Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Page 5

v

4.1.3 Ontology Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.1.4Ontology-Extended Data Sources . . . . . . . . . . . . . . . . . . . . . 119

4.2Ontology-Extended Query Operators . . . . . . . . . . . . . . . . . . . . . . . 123

4.2.1 Ontology-Extended Primitive Operators . . . . . . . . . . . . . . . . . 124

4.2.2 Ontology-Extended Statistical Operators . . . . . . . . . . . . . . . . . 126

4.3Semantic Heterogeneity and Statistical Queries . . . . . . . . . . . . . . . . . . 127

4.4 Algorithms for Learning Classifiers from Heterogeneous Distributed Data . . . 129

4.4.1 Naive Bayes Classifiers from Heterogeneous Data . . . . . . . . . . . . 132

4.4.2Decision Tree Induction from Heterogeneous Data . . . . . . . . . . . . 133

4.4.3Support Vector Machines from Heterogeneous Data . . . . . . . . . . . 133

4.4.4 Learning Threshold Functions from Heterogeneous Data. . . . . . . . 135

4.4.5 k-Nearest Neighbors Classifiers from Heterogeneous Data . . . . . . . . 135

4.5Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5 SUFFICIENT STATISTICS GATHERING . . . . . . . . . . . . . . . . .139

5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.2Central Resource Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.3 Query Answering Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.4Query Optimization Component . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.4.1Optimization Problem Definition . . . . . . . . . . . . . . . . . . . . . 146

5.4.2Planning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.5Sufficient Statistics Gathering: Example . . . . . . . . . . . . . . . . . . . . . 151

5.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6 INDUS: A FEDERATED QUERY-CENTRIC APPROACH TO

LEARNING CLASSIFIERS FROM DISTRIBUTED HETERO-

GENEOUS AUTONOMOUS DATA SOURCES . . . . . . . . . . . . . .156

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.2From Weka to AirlDM to INDUS . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.3Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161