Learning classifiers from distributed, semantically heterogeneous, autonomous

data sources

by

Doina Caragea

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Computer Science

Program of Study Committee:

Vasant Honavar, Major Professor

Dianne Cook

Drena Dobbs

David Fernandez-Baca

Leslie Miller

Iowa State University

Ames, Iowa

2004

Copyright c ? Doina Caragea, 2004. All rights reserved.

ii

iii

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .vii

LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiv

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

1INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.1Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

1.2 Traditional Machine Learning Limitations. . . . . . . . . . . . . . . . . . . .4

1.3Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5

1.4Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

1.4.1 Distributed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .9

1.4.2Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.4.3Learning Classifiers from Heterogeneous Data . . . . . . . . . . . . . . 18

1.5Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

2LEARNING CLASSIFIERS FROM DATA. . . . . . . . . . . . . . . . . 23

2.1Machine Learning Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

2.2Learning from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.3Examples of Algorithms for Learning from Data . . . . . . . . . . . . . . . . .27

2.3.1Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.3.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .30

2.3.3Perceptron Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.4 Support Vector Machines and Related Large Margin Classifiers . . . . .33

2.3.5k Nearest Neighbors Classifiers . . . . . . . . . . . . . . . . . . . . . . 40

iv

2.4 Decomposition of Learning Algorithms into Information Extraction and Hy-

pothesis Generation Components . . . . . . . . . . . . . . . . . . . . . . . . . 42

2.5 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.6Examples of Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 47

2.6.1 Sufficient Statistics for Naive Bayes Classifiers . . . . . . . . . . . . . .47

2.6.2 Sufficient Statistics for Decision Trees . . . . . . . . . . . . . . . . . . .47

2.6.3Sufficient Statistics for Perceptron Algorithm . . . . . . . . . . . . . . . 48

2.6.4 Sufficient Statistics for SVM . . . . . . . . . . . . . . . . . . . . . . . .50

2.6.5 Sufficient Statistics for k-NN . . . . . . . . . . . . . . . . . . . . . . . . 51

2.7 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

3LEARNING CLASSIFIERS FROM DISTRIBUTED DATA . . . . . . . 54

3.1Learning from Distributed Data . . . . . . . . . . . . . . . . . . . . . . . . . .54

3.2 General Strategy for Learning from Distributed Data . . . . . . . . . . . . . .58

3.3Algorithms for Learning Classifiers from Distributed Data . . . . . . . . . . .60

3.3.1Learning Naive Bayes Classifiers from Distributed Data . . . . . . . . . 61

3.3.2 Learning Decision Tree Classifiers from Distributed Data . . . . . . . .68

3.3.3Horizontally Fragmented Distributed Data . . . . . . . . . . . . . . . . 68

3.3.4 Learning Threshold Functions from Distributed Data . . . . . . . . . .78

3.3.5 Learning Support Vector Machines from Distributed Data. . . . . . . 84

3.3.6 Learning k Nearest Neighbor Classifiers from Distributed Data . . . . .92

3.4Statistical Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

3.4.1Operator Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

3.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4 LEARNING CLASSIFIERS FROM SEMANTICALLY HETERO-

GENEOUS DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.1Integration of the Data at the Semantic Level . . . . . . . . . . . . . . . . . . 107

4.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.1.2Ontology Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

v

4.1.3 Ontology Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.1.4Ontology-Extended Data Sources . . . . . . . . . . . . . . . . . . . . . 119

4.2Ontology-Extended Query Operators . . . . . . . . . . . . . . . . . . . . . . . 123

4.2.1 Ontology-Extended Primitive Operators . . . . . . . . . . . . . . . . . 124

4.2.2 Ontology-Extended Statistical Operators . . . . . . . . . . . . . . . . . 126

4.3Semantic Heterogeneity and Statistical Queries . . . . . . . . . . . . . . . . . . 127

4.4 Algorithms for Learning Classifiers from Heterogeneous Distributed Data . . . 129

4.4.1 Naive Bayes Classifiers from Heterogeneous Data . . . . . . . . . . . . 132

4.4.2Decision Tree Induction from Heterogeneous Data . . . . . . . . . . . . 133

4.4.3Support Vector Machines from Heterogeneous Data . . . . . . . . . . . 133

4.4.4 Learning Threshold Functions from Heterogeneous Data. . . . . . . . 135

4.4.5 k-Nearest Neighbors Classifiers from Heterogeneous Data . . . . . . . . 135

4.5Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5 SUFFICIENT STATISTICS GATHERING . . . . . . . . . . . . . . . . .139

5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.2Central Resource Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

5.3 Query Answering Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.4Query Optimization Component . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.4.1Optimization Problem Definition . . . . . . . . . . . . . . . . . . . . . 146

5.4.2Planning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

5.5Sufficient Statistics Gathering: Example . . . . . . . . . . . . . . . . . . . . . 151

5.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

6 INDUS: A FEDERATED QUERY-CENTRIC APPROACH TO

LEARNING CLASSIFIERS FROM DISTRIBUTED HETERO-

GENEOUS AUTONOMOUS DATA SOURCES . . . . . . . . . . . . . .156

6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

6.2From Weka to AirlDM to INDUS . . . . . . . . . . . . . . . . . . . . . . . . . 158

6.3Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161