Page 1
Learning classifiers from distributed, semantically heterogeneous, autonomous
data sources
by
Doina Caragea
A dissertation submitted to the graduate faculty
in partial fulfillment of the requirements for the degree of
DOCTOR OF PHILOSOPHY
Major: Computer Science
Program of Study Committee:
Vasant Honavar, Major Professor
Dianne Cook
Drena Dobbs
David Fernandez-Baca
Leslie Miller
Iowa State University
Ames, Iowa
2004
Copyright c ? Doina Caragea, 2004. All rights reserved.
Page 2
ii
Graduate College
Iowa State University
This is to certify that the doctoral dissertation of
Doina Caragea
has met the dissertation requirements of Iowa State University
Major Professor
For the Major Program
Page 3
iii
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xiv
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .xvi
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
1.2Traditional Machine Learning Limitations . . . . . . . . . . . . . . . . . . . .4
1.3 Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5
1.4 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8
1.4.1 Distributed Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .9
1.4.2 Information Integration . . . . . . . . . . . . . . . . . . . . . . . . . . .16
1.4.3 Learning Classifiers from Heterogeneous Data . . . . . . . . . . . . . .18
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 LEARNING CLASSIFIERS FROM DATA. . . . . . . . . . . . . . . . . 23
2.1 Machine Learning Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23
2.2Learning from Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26
2.3 Examples of Algorithms for Learning from Data . . . . . . . . . . . . . . . . .27
2.3.1 Naive Bayes Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . .27
2.3.2 Decision Tree Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .30
2.3.3 Perceptron Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.4Support Vector Machines and Related Large Margin Classifiers . . . . .33
2.3.5 k Nearest Neighbors Classifiers . . . . . . . . . . . . . . . . . . . . . .40
Page 4
iv
2.4 Decomposition of Learning Algorithms into Information Extraction and Hy-
pothesis Generation Components . . . . . . . . . . . . . . . . . . . . . . . . .42
2.5 Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43
2.6 Examples of Sufficient Statistics . . . . . . . . . . . . . . . . . . . . . . . . . .47
2.6.1 Sufficient Statistics for Naive Bayes Classifiers . . . . . . . . . . . . . .47
2.6.2 Sufficient Statistics for Decision Trees . . . . . . . . . . . . . . . . . . .47
2.6.3Sufficient Statistics for Perceptron Algorithm . . . . . . . . . . . . . . . 48
2.6.4 Sufficient Statistics for SVM . . . . . . . . . . . . . . . . . . . . . . . .50
2.6.5 Sufficient Statistics for k-NN . . . . . . . . . . . . . . . . . . . . . . . .51
2.7 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
3 LEARNING CLASSIFIERS FROM DISTRIBUTED DATA . . . . . . .54
3.1 Learning from Distributed Data . . . . . . . . . . . . . . . . . . . . . . . . . .54
3.2 General Strategy for Learning from Distributed Data . . . . . . . . . . . . . .58
3.3 Algorithms for Learning Classifiers from Distributed Data . . . . . . . . . . .60
3.3.1 Learning Naive Bayes Classifiers from Distributed Data . . . . . . . . .61
3.3.2Learning Decision Tree Classifiers from Distributed Data . . . . . . . .68
3.3.3 Horizontally Fragmented Distributed Data . . . . . . . . . . . . . . . .68
3.3.4 Learning Threshold Functions from Distributed Data . . . . . . . . . .78
3.3.5 Learning Support Vector Machines from Distributed Data. . . . . . .84
3.3.6 Learning k Nearest Neighbor Classifiers from Distributed Data . . . . .92
3.4 Statistical Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . .99
3.4.1 Operator Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
3.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4 LEARNING CLASSIFIERS FROM SEMANTICALLY HETERO-
GENEOUS DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106
4.1Integration of the Data at the Semantic Level . . . . . . . . . . . . . . . . . . 107
4.1.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.1.2 Ontology Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Page 5
v
4.1.3 Ontology Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.1.4Ontology-Extended Data Sources . . . . . . . . . . . . . . . . . . . . . 119
4.2 Ontology-Extended Query Operators . . . . . . . . . . . . . . . . . . . . . . . 123
4.2.1 Ontology-Extended Primitive Operators . . . . . . . . . . . . . . . . . 124
4.2.2 Ontology-Extended Statistical Operators . . . . . . . . . . . . . . . . . 126
4.3 Semantic Heterogeneity and Statistical Queries . . . . . . . . . . . . . . . . . . 127
4.4 Algorithms for Learning Classifiers from Heterogeneous Distributed Data . . . 129
4.4.1Naive Bayes Classifiers from Heterogeneous Data . . . . . . . . . . . . 132
4.4.2 Decision Tree Induction from Heterogeneous Data . . . . . . . . . . . . 133
4.4.3 Support Vector Machines from Heterogeneous Data . . . . . . . . . . . 133
4.4.4Learning Threshold Functions from Heterogeneous Data . . . . . . . . 135
4.4.5 k-Nearest Neighbors Classifiers from Heterogeneous Data . . . . . . . . 135
4.5 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5 SUFFICIENT STATISTICS GATHERING . . . . . . . . . . . . . . . . .139
5.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.2 Central Resource Repository . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.3 Query Answering Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.4 Query Optimization Component . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.4.1 Optimization Problem Definition . . . . . . . . . . . . . . . . . . . . . 146
5.4.2 Planning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.5 Sufficient Statistics Gathering: Example . . . . . . . . . . . . . . . . . . . . . 151
5.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6 INDUS: A FEDERATED QUERY-CENTRIC APPROACH TO
LEARNING CLASSIFIERS FROM DISTRIBUTED HETERO-
GENEOUS AUTONOMOUS DATA SOURCES. . . . . . . . . . . . . .156
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2 From Weka to AirlDM to INDUS . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.3 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Page 6
vi
6.3.1 Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.3.2 Learning NB Classifiers from Distributed Data . . . . . . . . . . . . . . 162
6.3.3 Learning NB Classifiers from Heterogeneous Distributed Data . . . . . 163
6.4 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.2Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.3Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
GLOSSARY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .183
Page 7
vii
LIST OF TABLES
2.1 Data set D: Decide EnjoySport based on Weather Data . . . . . . . .47
4.1 Data set D1: Weather Data collected by company C1
. . . . . . . . . 108
4.2 Data set D2: Weather Data collected by the company C2 . . . . . . . 108
4.3 Mappings from H1(is-a) and H2(is-a) (corresponding to the data sets
D1and D2) to HU(is-a) found using name matching strategy . . . . . 118
4.4Mappings from H1(is-a) and H2(is-a) (corresponding to the data sets
D1and D2, respectively) to HU(is-a) found from equality constraints . 118
6.1 Learning from distributed UCI/CENSUS-INCOME data sources . . . 163
6.2 Learning from heterogeneous UCI/ADULT data sources . . . . . . . . 167
Page 8
viii
LIST OF FIGURES
1.1 Example of a scenario that calls for knowledge acquisition from au-
tonomous, distributed, semantically heterogeneous data source - dis-
covery of protein sequence-structure-function relationships using infor-
mation from PROSITE, MEROPS, SWISSPROT repositories of pro-
tein sequence, structure, and function data. O1and O2are two user
ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
1.2 Learning revisited: identify sufficient statistics, gather the sufficient
statistics and generate the current algorithm output . . . . . . . . . .6
1.3 Exact learning from distributed data: distribute the statistical query
among the distributed data sets and compose their answers . . . . . .6
1.4 Learning from semantically heterogeneous distributed data: each data
source has an associated ontology and the user provides a global on-
tology and mappings from the local ontologies to the global ontology .7
1.5INDUS: INtelligent Data Understanding System . . . . . . . . . . . .8
2.1Learning algorithm (Up) Learning component (Down) Classification
component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27
2.2 Naive Bayes classifier . . . . . . . . . . . . . . . . . . . . . . . . . . .29
2.3 ID3 algorithm - greedy algorithm that grows the tree top-down, by
selecting the best attribute at each step (according to the information
gain). The growth of the tree stops when all the training examples are
correctly classified . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32
2.4 Linearly separable data set . . . . . . . . . . . . . . . . . . . . . . . .33
Page 9
ix
2.5 The Perceptron algorithm . . . . . . . . . . . . . . . . . . . . . . . . .34
2.6 Maximum margin classifier . . . . . . . . . . . . . . . . . . . . . . . .35
2.7 Non-linearly separable data mapped to a feature space where it be-
comes linearly separable . . . . . . . . . . . . . . . . . . . . . . . . . .37
2.8 Support Vector Machines algorithm . . . . . . . . . . . . . . . . . . .38
2.9The Dual Perceptron algorithm. . . . . . . . . . . . . . . . . . . . .39
2.10 Decision boundary induced by the 1 nearest neighbor classifier . . . .41
2.11The k Nearest Neighbors algorithm . . . . . . . . . . . . . . . . . . .42
2.12 Learning revisited: identify sufficient statistics, gather the sufficient
statistics and generate the current algorithm output . . . . . . . . . .43
2.13 Naive Bayes classifiers learning as information extraction and hypoth-
esis generation: the algorithm asks a joint count statistical query for
each attribute in order to construct the classifier . . . . . . . . . . . .48
2.14 Decision Tree learning as information extraction and hypothesis gener-
ation: for each node, the algorithm asks a joint count statistical query
and chooses the best attribute according to the count distribution . .49
2.15 The Perceptron algorithm as information extraction and hypothesis
generation: at each iteration i + 1, the current weight wi+1(D) is up-
dated based on the refinement sufficient statistic s(D,wi(D)) . . . . . 50
2.16 The SVM algorithm as information extraction and hypothesis genera-
tion: the algorithm asks for the support vectors and their associated
weights) and the weight w is computed based on this information . .51
2.17 k-NN Algorithm as information extraction and hypothesis generation:
for each example x the algorithm asks for the k nearest neighbors
and computes the classification h(x) taking a majority vote over these
neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
3.1 Data fragmentation: (Left) Horizontally fragmented data (Right) Ver-
tically fragmented data . . . . . . . . . . . . . . . . . . . . . . . . . .54
Page 10
x
3.2 Multi relational database . . . . . . . . . . . . . . . . . . . . . . . . .55
3.3 Exact distributed learning: distribute the statistical query among the
distributed data sets and compose their answers.(a) Eager learning (b)
Lazy learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
3.4 Distributed statistics gathering: (Left) Serial (Right) Parallel . . . . .59
3.5 Learning Naive Bayes classifiers from horizontally distributed data:
the algorithm asks a joint count statistical query for each attribute in
order to construct the classifier. Each query is decomposed into sub-
queries, which are sent to the distributed data sources and the answers
to sub-queries are composed and sent back to the learning algorithm62
3.6 Naive Bayes classifier from horizontally fragmented data . . . . . . . .63
3.7 Naive Bayes classifier from vertically fragmented data . . . . . . . . .66
3.8 Learning Decision Tree classifiers from horizontally fragmented dis-
tributed data: for each node, the algorithm asks a joint count statis-
tical query, the query is decomposed into sub-queries and sent to the
distributed data sources, and the resulting counts are added up and
sent back to the learning algorithm. One iteration is shown . . . . . .70
3.9 Decision Tree classifiers: finding the best attribute for split when data
are horizontally fragmented . . . . . . . . . . . . . . . . . . . . . . . .71
3.10 Decision Tree classifiers: finding the best attribute for split when data
are vertically fragmented . . . . . . . . . . . . . . . . . . . . . . . . .75
3.11 Learning Threshold Functions from horizontally distributed data: the
algorithm asks a statistical query, the query is decomposed into sub-
queries which are subsequently sent to the distributed data sources, and
the final result is sent back to the learning algorithm. One iteration i
is shown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79
3.12 The Perceptron algorithm when data is horizontally fragmented. . . 80
Page 11
xi
3.13 Learning SVM from horizontally distributed data: the algorithm asks
a statistical query, the query is decomposed into sub-queries which are
sent to the distributed data sources, the results are composed, and the
final result is sent back to the learning algorithm . . . . . . . . . . . .85
3.14 Naive SVM from horizontally fragmented distributed data . . . . . . .85
3.15 Counterexample to naive SVM from distributed data. . . . . . . . .86
3.16Convex hull based SVM learning from horizontally fragmented dis-
tributed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88
3.17 Exact and efficient LSVM learning from horizontally fragmented dis-
tributed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91
3.18 Learning k-NN classifiers from horizontally fragmented distributed data:
the algorithm asks a statistical query, the query is decomposed into
sub-queries which are sent to the distributed data sources, results are
composed, and the final result is sent to the learning algorithm . . . .93
3.19 Algorithm for learning k Nearest Neighbors classifiers from horizontally
fragmented distributed data . . . . . . . . . . . . . . . . . . . . . . .94
3.20Algorithm for k Nearest Neighbors classifiers from vertically fragmented
distributed data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97
4.1 Learning from semantically heterogeneous distributed data: each data
source has an associated ontology and the user provides a user ontology
and mappings from the data source ontologies to the user ontology . . 106
4.2 The ontology (part-of and is-a hierarchies) associated with the data
set D1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3 The ontology (part-of and is-a hierarchies) associated with the data
set D2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4User ontology OU, which represents an integration of the hierarchies
corresponding to the data sources D1and D2in weather domain . . . 113
Page 12
xii
4.5 Algorithm for finding mappings between a set of data source hierarchies
and a user hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.6 Algorithm for checking the consistency of a set of partial injective
mappings with a set of an interoperation constraints and with the
order preservation property . . . . . . . . . . . . . . . . . . . . . . . . 117
4.7 The AVTs corresponding to the Prec attribute in the ontologies O1,
O2and OU, associated with the data sources D1and D2and a user,
respectively (after the names have been matched). . . . . . . . . . . 129
5.1 The architecture of a system for gathering sufficient statistics from
distributed heterogeneous autonomous data sources . . . . . . . . . . 140
5.2 Central resource repository: data sources, learning algorithms, itera-
tors and users registration . . . . . . . . . . . . . . . . . . . . . . . . 141
5.3 Simple user workflow examples . . . . . . . . . . . . . . . . . . . . . . 143
5.4 Internal translation of the workflows in Figure 5.3 according to the
semantic imposed by the user ontology . . . . . . . . . . . . . . . . . 143
5.5 Example of RDF file for a data source (Prosite) described by name,
URI, schema and operators allowed by the data source . . . . . . . . . 144
5.6 Query answering engine . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.7 Query optimization (planning) algorithm . . . . . . . . . . . . . . . . 149
5.8Operator placement algorithm . . . . . . . . . . . . . . . . . . . . . . 150
5.9(Left) User workflow Naive Bayes example (Right) User workflow in-
ternal translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.10 The four plans found by the query optimizer for Naive Bayes example.
The operators below the dotted line are executed at the remote data
sources, and the operators above the dotted line are executed at the
central place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Page 13
xiii
6.1INDUS: Intelligent Data Understanding System. Three data sources
are shown: PROSITE, MEROPS and SWISSPROT together with
their associated ontologies. Ontologies O1 and O2 are two different
user ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.2 AirlDM: Data source independent learning algorithms through the
means of sufficient statistics and wrappers . . . . . . . . . . . . . . . . 159
6.3 Taxonomy for the attribute Ocupation in user (test) data. The filled
nodes represent the level of abstraction specified by the user. . . . . 164
6.4Taxonomy for the attribute Ocupation in the data set Dh
1. The filled
nodes represent the level of abstraction determined by the user cut.
Values Priv-house-serv, Other-service, Machine-op-inspct, Farming-
fishing are over specified with respect to the user cut . . . . . . . . . . 165
6.5 Taxonomy for the attribute Ocupation in the data set Dh
2. The filled
nodes represent the level of abstraction determined by the user cut.
The value (Sales+Tech-support) is underspecified with respect to the
user cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Page 14
xiv
ACKNOWLEDGMENTS
I express my gratitude to my advisor Dr. Vasant Honavar for guiding the research presented
in this dissertation throughout my Ph.D. student years. He has been a constant source of
motivation and encouragement. He helped me to develop my own views and opinions about
the research I undertook. I thank him for being always accessible and for providing invaluable
feedback on my work. I also thank him for providing funding for my research and for helping
me to receive an IBM Fellowship two years in a row. Thanks for organizing the AI seminar
which brought up thoughtful discussions and helped me to broaden my views about Artificial
Intelligence. Most importantly, I thank him for his friendship and for his encouragement when
I felt confused or overwhelmed.
I give my warm thanks to Dr. Dianne Cook for introducing me to the wonderful world
of visualization, for being a close collaborator and a model for me, as well as well as a good
friend. Thanks also go to Dr. Drena Dobbs for introducing me to the world of molecular
biology and motivating me to pursue a minor in bioinformatics, and to Dr. Leslie Miller
and Dr. David Fernandez-Baca for being the first two people that I interacted with in my
first semester at Iowa State University. They both helped me overcome my fears of graduate
school. I thank everybody in my committee for fruitful interactions and for their support.
I am grateful to Adrian Silvescu for motivating me to go to graduate school, for collabo-
rating with me on several projects, for his great ideas and enthusiasm, and for being one of
my best friends ever.
Thanks to all the students who were present in the Artificial Intelligence lab at Iowa State
University while I was there. They were great colleagues and provided me with a friendly
environment to work in. Thanks to Jyotishman Pathak for closely collaborating with me on
ontology-extended data sources and ontology-extended workflow components during the last
Page 15
xv
year of my Ph.D. studies. Thanks to Facundo Bromberg for interesting discussions about
multi-agent systems. Thanks to Dae-Ki Kang for letting me use his tools for generating tax-
onomies and for teaching me about AT&T graphviz. Thanks to Jun Zhang for the useful
discussions about partially specified data. Thanks to Jie Bao for discussions and insights
about ontology management. Thanks to Changui Yan and Carson Andorf for useful dis-
cussions about biological data sources. Thanks to Oksana Yakhnenko for helping with the
implementation of AirlDM. Finally, thanks to Jaime Reinoso-Castillo for the first INDUS
prototype.
It has been an honor to be part of the Computer Science Department at Iowa State
University. There are many individuals in Computer Science Department that I would like to
thank for their direct or indirect contribution to my research or education. Special thanks to
Dr. Jack Lutz for introducing me to the fascinating Kolmogorov Complexity. I thoroughly
enjoyed the two courses and the seminars I took with Dr. Lutz.
I also thank the current as well as previous staff members in Computer Science Department,
especially Linda Dutton, Pei-Lin Shi and Melanie Eckhart. Their kind and generous assistance
with various tasks was of great importance during my years at Iowa State University.
I am grateful to Dr. John Mayfield for financial support from the Graduate College and
to Sam Ellis from IBM Rochester for his assistance in obtaining the IBM graduate fellowship.
The research described in this thesis was supported in part by grants from the National
Science Foundation (NSF 0219699) and the National Institute of Health (NIH GM066387) to
Vasant Honavar.
Above all, I am fortunate to have family and friends that have provided so much sup-
port, encouragements and love during my Ph.D. years and otherwise. Thanks to my parents,
Alexandra and Paul Caragea, to my sister Cornelia Caragea and to my cousin Petruta Caragea.
Thanks to my friends Pia Sindile, Nicoleta Roman, Simona Verga, Veronica Nicolae, Calin
Anton, Liviu Badea, Mircea Neagu, Marius Vilcu, Cristina and Marcel Popescu, Petrica and
Mirela Vlad, Anna Atramentova, Laura Hamilton, Carol Hand, Barbara Gwiasda, Emiko Fu-
rukawa, Shireen Choobineh, JoAnn Kovar, Mike Collyer, and especially to Charles Archer for
helping me believe that I was able to complete this thesis.
Page 16
xvi
ABSTRACT
Recent advances in computing, communications, and digital storage technologies, together
with development of high throughput data acquisition technologies have made it possible to
gather and store large volumes of data in digital form. These developments have resulted in
unprecedented opportunities for large-scale data-driven knowledge acquisition with the poten-
tial for fundamental gains in scientific understanding (e.g., characterization of macromolecular
structure-function relationships in biology) in many data-rich domains. In such applications,
the data sources of interest are typically physically distributed, semantically heterogeneous
and autonomously owned and operated, which makes it impossible to use traditional machine
learning algorithms for knowledge acquisition.
However, we observe that most of the learning algorithms use only certain statistics com-
puted from data in the process of generating the hypothesis that they output and we use this
observation to design a general strategy for transforming traditional algorithms for learning
from data into algorithms for learning from distributed data. The resulting algorithms are
provably exact in that the classifiers produced by them are identical to those obtained by the
corresponding algorithms in the centralized setting (i.e., when all of the data is available in
a central location) and they compare favorably to their centralized counterparts in terms of
time and communication complexity.
To deal with the semantical heterogeneity problem, we introduce ontology-extended data
sources and define a user perspective consisting of an ontology and a set of interoperation
constraints between data source ontologies and the user ontology. We show how these con-
straints can be used to define mappings and conversion functions needed to answer statistical
queries from semantically heterogeneous data viewed from a certain user perspective. That
is further used to extend our approach for learning from distributed data into a theoretically
Page 17
xvii
sound approach to learning from semantically heterogeneous data.
The work described above contributed to the design and implementation of AirlDM, a col-
lection of data source independent machine learning algorithms through the means of sufficient
statistics and data source wrappers, and to the design of INDUS, a federated, query-centric
system for knowledge acquisition from distributed, semantically heterogeneous, autonomous
data sources.
Page 18
1
1 INTRODUCTION
1.1 Motivation
Recent advances in computing, communications, and digital storage technologies, together
with development of high throughput data acquisition technologies have made it possible
to gather and store large volumes of data in digital form. For example, advances in high
throughput sequencing and other data acquisition technologies have resulted in gigabytes of
DNA, protein sequence data, and gene expression data being gathered at steadily increasing
rates in biological sciences; organizations have begun to capture and store a variety of data
about various aspects of their operations (e.g., products, customers, and transactions); com-
plex distributed systems (e.g., computer systems, communication networks, power systems)
are equipped with sensors and measurement devices that gather and store a variety of data
for use in monitoring, controlling, and improving the operation of such systems.
These developments have resulted in unprecedented opportunities for large-scale data-
driven knowledge acquisition with the potential for fundamental gains in scientific understand-
ing (e.g., characterization of macro-molecular structure-function relationships in biology) in
many data-rich domains. To exploit these opportunities scientists at different institutions
need to collaborate and share information and findings in a field or across various research
fields [Hendler, 2003]. Thus, researchers working at one level of a problem may benefit from
data or results developed for a different level of that problem or even for a different problem.
However, more often than not, it is not easy for a scientist to be able to use the information
obtained from a different scientific community. Furthermore, even scientists working on the
same problem at different institutions find it difficult to combine their results. These difficul-
ties arise because of the large volume of information that would need to be moved around or
Page 19
2
because of the constraints imposed by the autonomy of the data collected by a particular in-
stitution (e.g., privacy constraints). Even in cases when data can be shared, the heterogeneity
of the data collected by different scientific communities or organizations brings several diffi-
culties. This heterogeneity could be in terms of structure (relational databases, flat files, etc.)
or content (different ontological commitments, which means different assumptions concerning
the objects that exist in the world, the properties or attributes of the objects, the possible
values of attributes, and their intended meaning)[Levy, 2000]. Thus, the current technology is
not sufficient for the need of collaborative and interdisciplinary “e-Science” [e-Science, 2001],
but fortunately, new technologies are emerging with the potential to revolutionize the ability
of scientists to do collaborative work [Hendler, 2003].
Among these, a new generation of Web technology, the Semantic Web [Berners-Lee et al.,
2001], aims to support seamless and flexible access and use of semantically heterogeneous,
networked data, knowledge, and services. Thus, the Semantic Web is supposed to improve
communication between people using differing terminologies, to extend the interoperability
of databases, and to provide new mechanisms for the support of agent-based computing in
which people and machines work more interactively, making possible a new level of interaction
among scientific communities [Hendler, 2003].
Examples of scientific domains that have started to use the Semantic Web include biological
[AMIAS, 2002], environmental[SWS, 2002]and astronomical[Szalay, 2001]domains, which are
trying to link together various heterogeneous resources. Even mathematical sciences[MONET,
2004] are exploring the use of the Semantic Web for making mathematical algorithms Web-
accessible from a variety of software packages.
The e-Science initiative in the UK [e-Science, 2001] brings together research scientists and
information technologists in an effort to make possible the Semantic Web vision in science, and
recently resulted in an initiative to unite the Semantic Web and Grid computing [Euroweb,
2002] as a step towards achieving the goals of the collaborative e-Science.
It is worth noting that the Semantic Web vision cannot be achieved without exploiting
artificial-intelligence technologies in addition to the Semantic Web [Berners-Lee et al., 2001].
Hence, there has been significant interest in Semantic Web “agents” that can answer queries
Page 20
3
based on information from Web pages and heterogeneous databases and pass them to programs
for analysis [Hendler, 2003].
Against this background, this dissertation explores the problem of automated or semi-
automated data driven knowledge acquisition (discovery of features, correlations, and other
complex relationships and hypotheses that describe potentially interesting regularities from
large data sets) from distributed semantically heterogeneous autonomous data sources (see
Figure 1.1).
2
PROSITE, OPROSITE
MEROPS, OMEROPS
SWISSPROT, OSWISSPROT
?
Exploration, Analysis, Learning, Discovery
Ontology O1
Ontology O
Figure 1.1 Example of a scenario that calls for knowledge acquisition
from autonomous, distributed, semantically heterogeneous data
source - discovery of protein sequence-structure-function rela-
tionships using information from PROSITE, MEROPS, SWIS-
SPROT repositories of protein sequence, structure, and function
data. O1and O2are two user ontologies
The major contributions of this dissertation include:
• A general strategy for design of algorithms for learning classifiers from dis-
tributed data [Caragea et al., 2004d]
• A general framework for design of algorithms for learning classifiers from
semantically heterogeneous data [Caragea et al., 2004b]
• Design of a query answering engine [Caragea et al., 2004a]
Page 21
4
• An open source package containing data source independent machine learn-
ing algorithms [Silvescu et al., 2004b]
1.2 Traditional Machine Learning Limitations
Machine learning algorithms [Mitchell, 1997; Duda et al., 2000] offer some of the most
cost-effective approaches to automated or semi-automated knowledge acquisition in scientific
domains. However, the applicability of current approaches to machine learning in emerging
data rich applications in practice is severely limited by a number of factors:
• Distributed Data Sources: As mentioned above, data repositories are large in size,
dynamic, and physically distributed. Consequently, it is neither desirable nor feasible
to gather all of the data in a centralized location for analysis. Hence, there is a need for
knowledge acquisition systems that can perform the necessary analysis of data at the
locations where the data and the computational resources are available and transmit the
results of analysis (knowledge acquired from the data) to the locations where they are
needed[Honavar et al., 1998]. In other domains, the ability of autonomous organizations
to share raw data may be limited due to a variety of reasons (e.g., privacy considerations)
[Agrawal and Srikant, 2000]. In such cases, there is a need for knowledge acquisition
algorithms that can learn from statistical summaries of data (e.g., counts of instances
that match certain criteria) that are made available as needed from the distributed data
sources in the absence of access to raw data.
• Heterogeneous Data Sources: According to the Semantic Web [Berners-Lee et al.,
2001], the ontological commitments associated with a data source are determined by the
intended use of the data repository (at design time). Furthermore, data sources that are
created for use in one context often find use in other contexts or applications. Semantic
differences among autonomously designed, owned, and operated data repositories are
simply unavoidable. Effective use of multiple sources of data in a given context requires
reconciliation of such semantic differences from the user’s point of view. Because users
often need to analyze data in different contexts from different perspectives, there is no
Page 22
5
single privileged ontology that can serve all users, or for that matter, even a single user,
in every context. Hence, there is a need for methods that can dynamically and efficiently
extract and integrate information needed for learning (e.g., statistics) from distributed,
semantically heterogeneous data based on user-specified ontologies and mappings be-
tween ontologies.
• Autonomous Data Sources: Data sources of interest are autonomously owned and
operated. Consequently, they differ in their structure and organization (relational
databases, flat files, etc.) and the operations that can be performed on the data source
(e.g., types of queries: relational queries, restricted subsets of relational queries, statis-
tical queries, keyword matches; execution of user-supplied code to compute answers to
queries that are not directly supported by the data source; storing results of computa-
tion at the data source for later use) and the precise mode of allowed interactions can
be quite diverse. Hence, there is a need for theoretically well-founded strategies for effi-
ciently obtaining the information needed for learning within the operational constraints
imposed by the data sources.
1.3 Our Approach
Our approach to the problem described above comes from revisiting the traditional formu-
lation of the problem of learning from data and observing that most of the learning algorithms
use only certain statistics computed from the data in the process of generating the hypotheses
that they output [Kearns, 1998]. This yields a natural decomposition of a learning algorithm
into two components: an information extraction component that formulates and sends a sta-
tistical query to a data source and a hypothesis generation component that uses the resulting
statistic to modify a partially constructed hypothesis (and further invokes the information
extraction component as needed) (see Figure 1.2).
In the light of this observation, an algorithm for learning from distributed data can be
also decomposed into two components: (1) information extraction from distributed data and
(2) hypothesis generation.
Page 23
6
Result
Statistical
Formulation
Query
D
Data
Learner
Hypothesis Generation
Query
Figure 1.2 Learning revisited: identify sufficient statistics, gather the suffi-
cient statistics and generate the current algorithm output
The information extraction from distributed data entails decomposing each statistical
query q posed by the information extraction component of the learner into sub-queries q1,··· ,qK
that can be answered by the individual data sources D1,··· ,DK, respectively, and a proce-
dure for combining the answers to the sub-queries into an answer for the original query q
(see Figure 1.3). This yields a general strategy for transforming algorithms for learning from
centralized data into exact algorithms for learning from distributed data (an algorithm Ldfor
learning from distributed data sets D1,··· ,DKis exact relative to its centralized counterpart
L if the hypothesis produced by Ldis identical to that obtained by L from the complete data
set D obtained by appropriately combining the data sets D1,··· ,DK).
...
Statistical
Formulation
Query
Decomposition
Query
Answer
Composition
D
D
D
1
2
q
q
q
1
2
Query
K
K
Hypothesis Generation
Learner
q
Result of q
Figure 1.3Exact learning from distributed data: distribute the statistical
query among the distributed data sets and compose their an-
swers
We consider two types of data fragmentation: horizontal fragmentation wherein (possi-
bly overlapping) subsets of data tuples are stored at different sites and vertical fragmenta-
tion wherein (possibly overlapping) sub-tuples of data tuples are stored at different sites and
we apply this strategy to design exact algorithms for learning Naive Bayes, Decision Trees,
Threshold Functions, Support Vector Machines and k-NN classifiers from distributed data.
Page 24
7
We compare the resulting algorithms with the traditional algorithms in terms of time and
communication complexity.
In order to extend our approach to learning from distributed data (which assumes a com-
mon ontology that is shared by all of the data sources) into effective algorithms for learning
classifiers from semantically heterogeneous distributed data sources, we develop techniques for
answering the statistical queries posed by the learner in terms of the learner’s ontology O from
the heterogeneous data sources (where each data source Dkhas an associated ontology Ok)
(see Figure 1.4). Thus, we solve a variant of the problem of integrated access to distributed
data repositories, the data integration problem[Levy, 2000], in order to be able to use machine
learning approaches to acquire knowledge from semantically heterogeneous data.
...
Statistical
Formulation
Query
Decomposition
Query
Answer
Composition
q
q
1
2
Query
User Ontology O
User Ontology O
D
D2
1
, O
, O
1
2
O1
O2
Mappings
M(Oi−>O)
D , O
K
K
OK
qK
Hypothesis Generation
Result
O
q
Learning Algorithm
Figure 1.4 Learning from semantically heterogeneous distributed data:
each data source has an associated ontology and the user pro-
vides a global ontology and mappings from the local ontologies
to the global ontology
It can be seen that learning from distributed heterogeneous autonomous data sources re-
duces to the problem of developing sound and complete techniques for answering statistical
queries from semantically heterogeneous data sources under a variety of constraints and as-
sumptions motivated by application scenarios encountered in practice.
We define a statistical query language based on operators that are needed to formulate
and manipulate statistical queries, and we design a query answering engine, that has access
to a resource repository where all the information available in the system is registered. The
engine uses these resources to decompose a query q into sub queries q1,··· ,qK that can be
answered by the individual data sources D1,··· ,DKrespectively, finding an optimal plan for
Page 25
8
executing each of the sub queries qi, and also a procedure for combining the answers to the
sub queries into an answer for the original query q.
This builds on recent work on INDUS (see Figure 1.5), an ontology based federated, query-
centric approach to information integration and learning from distributed, heterogeneous data
sources. INDUS offers the functionality necessary to flexibly integrate information from multi-
ple heterogeneous data sources and structure the results according to a user-supplied ontology.
Learning algorithms are linked to the information integration component in INDUS, and thus
users can perform learning from distributed heterogeneous data sources in a transparent way.
Learning Algorithms
PROSITE, OPROSITE
MEROPS, OMEROPS
SWISSPROT, OSWISSPROT
Ontology O1
Ontology O2
INDUS Query Answering Engine
Figure 1.5 INDUS: INtelligent Data Understanding System
1.4Literature Review
The work related to the research in this dissertation belongs to one of the following three
categories: distributed learning [Liu et al., 2004], information integration [Levy, 2000], or the
combination of distributed learning and information integration, which we call learning from
semantically heterogeneous data [Caragea et al., 2004d].
Page 26
9
1.4.1 Distributed Learning
Distributed learning (a.k.a., distributed data mining) has received considerable attention in
literature[Liu et al., 2004]in recent years. Work in this area can be reviewed from three points
of view: distributed learning algorithms, architectures and systems for distributed learning and
applications of distributed learning to real world problems [Park and Kargupta, 2002b]. We
discuss each of them in what follows.
1.4.1.1Distributed Learning Algorithms
Most of the distributed learning algorithms in the literature deal with homogeneous data.
Among these, most of the existent algorithms work for horizontal data distributions, with a
few exceptions that will be pointed out below.
Many of the approaches to distributed learning come from the desire to scale up algo-
rithms to large data sets [Provost and Kolluri, 1999; Provost, 2000]. Conceptually there is a
big difference between approaches to distributed learning coming from scaling up algorithms,
where the data are distributed by the algorithm in order to increase the overall efficiency, and
approaches that assume that data are inherently distributed and autonomous, and thus re-
strictions and constrains may need to be taken into account. The work in this dissertation falls
in the second category. We say “learning from distributed data” as opposed to “distributed
learning” to point out this difference.
Parallel Data Mining
Early work on distributed data mining appeared as a need to scale up learning algorithms
to large data set[Provost and Kolluri, 1999]. Among other approaches to the problem of learn-
ing from large data sets, high performance parallel computing (a.k.a. parallel data mining)
distinguishes itself as very useful for distributed settings as well.
Srivastava et al. [1999] proposed methods for distributing a large centralized data set to
multiple processors to exploit parallel processing to speed up learning. Provost and Kolluri
[1999] and Grossman and Guo [2001] surveyed several methods that exploit parallel processing
for scaling up data mining algorithms to work with large data sets.
Page 27
10
There has been a lot of research focused on parallelizing specific algorithms. For example,
in [Amado et al., 2003; Andrade et al., 2003; Jin and Agrawal, 2003] the authors showed how
the decision tree algorithm can be parallelized. In [Dhillon and Modha, 1999; Foti et al., 2000;
Samatova et al., 2002] parallel clustering algorithms were considered. In [Tveit and Engum,
2003; Poulet, 2003]the authors proposed parallel solutions to some SVM algorithm variants. A
lot of work [Agrawal and Shafer, 1996; Manning and Keane, 2001] has focused on parallelizing
association rules [Agrawal and Shafer, 1996; Manning and Keane, 2001; Park et al., 1995;
Parthasarathy et al., 2001; Wolff et al., 2003; Zaiane et al., 2001; Zaki, 1999].
Ensembles Approach to Distributed Learning
Several distributed learning algorithms have their roots in ensemble methods [Dietterich,
2000]. Thus, Domingos [1997] and Prodromidis et al. [2000] used ensemble of classifiers
approaches to learning from horizontally fragmented distributed data, which involves learning
separate classifiers from each data set and combining them typically using a weighted voting
scheme. In general, this combination requires gathering a subset of data from each of the data
sources at a central site to determine the weights to be assigned to the individual hypotheses
(or alternatively shipping the ensemble of classifiers and associated weights to the individual
data sources where they can be executed on local data to set the weights), which is not
desirable. Other ensemble approaches were proposed in [Fern and Brodley, 2003; Hall and
Bowyer, 2003; Jouve and Nicoloyannis, 2003; Tsoumakas and Vlahavas, 2002]. Besides the
need to transmit some subset of data to the central site, there is another potential drawback of
the ensemble of classifiers approach to learning from distributed data, mainly that the resulting
ensemble of classifiers is typically much harder to comprehend than a single classifier. Another
important limitation of the ensemble classifier approach to learning from distributed data is
the lack of guarantees concerning generalization accuracy of the resulting hypothesis relative
to the hypothesis obtained in the centralized setting.
Cooperation-based Distributed Learning
Although learning with cooperation scenarios could be very often met in real world sit-
uations, there are not many distributed learning algorithms that use the cooperation in an
active way to obtain the final result, with a few notable exceptions.
Page 28
11
Provost and Hennessy [1996] proposed a powerful, yet practical distributed rule learning
(DRL) algorithm using cooperation. They make use of several criteria to estimate the proba-
bility that a rule is correct (and in particular to evaluate a rule), and define what it means for
a rule to be satisfactory or acceptable over a set of examples (a rule can be acceptable for a
local learner but not satisfactory for the batch learner). The algorithm tries to find acceptable
local rules that are also satisfactory as global rules. In [Leckie and Kotagiri, 2002] the authors
proposed an algorithm for learning to share distributed probabilistic beliefs. Morinaga et al.
described another approach to collaborative data mining.
As opposed to collaboration by exchanging models (e.g. rules) between learners, in [Turin-
sky and Grossman, 2000] data could be moved from one site to another in order to fully
exploit the resources of the network. One practical example of a learning algorithm that uses
cooperation to exchange data is described in [Kargupta et al., 1999] (this approach works for
vertically distributed data as it will be described below).
Learning from Vertically Distributed Data
Although most of the distributed learning algorithms assume horizontal data fragmenta-
tion, there are a few notable exceptions. Bhatnagar and Srinivasan [1997] proposed algorithms
for learning decision tree classifiers from vertically fragmented distributed data. WoRLD
system [Aronis et al., 1996] is a collaborative approach to concept learning from vertically
fragmented data. It works by computing the cardinal distribution of feature values in the in-
dividual data sets, followed by propagation of this distribution across different sites. Features
with strong correlations to the concept to be learned are identified based on the first order sta-
tistical approximation to the cardinal distribution. Being based on first order approximations,
this approach is impractical for problems where higher order statistics are needed.
Tumer and Ghosh [2000] proposed an ensemble approach to combine local classifiers. They
used an order statistics-based technique for combining high variance models generated from
heterogeneous sites.
Park and his colleagues [Park and Kargupta, 2002a] observed that inter-site patterns can-
not be captured by aggregating heterogeneous classifiers. To deal with this problem, at each
site, they construct a subset of the data that a the particular classifier cannot classify with
Page 29
12
high confidence and ship such subsets of data at the central site, where a classifier is build.
This classifier is used when data at one site is classified with low confidence by the classifier at
that site. Although this approach gives better results than simply aggregating the classifiers,
it requires data shipping and its performance is sensitive to the sample size.
Kargupta and his group proposed a framework to address the problem of learning from
heterogeneous data, called Collective Data Mining (CDM)[Kargupta et al., 1999]. Given a set
of labeled data, CDM learns a function that approximates it. CDM relies on the observation
that any function can be represented in a distributed fashion using an appropriate set of basis
functions. Thus, at each data source, the learner estimates the Fourier coefficients from the
local data, and transmits them to a central site. These estimates are combined to obtain
a set of Fourier coefficients for the function to be learned (a process which may require a
subset of the data from each source to be transmitted to the central site). At present, there
are no guarantees concerning the performance of the hypothesis obtained in the distributed
setting relative to that obtained in the centralized setting. Furthermore, a given set of Fourier
coefficients can correspond to multiple hypothesis.
Based on CDM framework, Kargupta et al. [1999] described an algorithm for learning
decision trees from vertically fragmented distributed data using a technique proposed by
Mansour [1994] for approximating a decision tree using Fourier coefficients corresponding
to attribute combinations whose size is at most logarithmic in the number of nodes in the
tree. The CDM framework is also used to design distributed clustering algorithms based
on collective principal component analysis [Kargupta et al., 2001] or to designed distributed
algorithms for Bayesian network learning (structure or parameters) [Chen et al., 2001; Chen
and Krishnamoorthy, 2002; Chen et al., 2003b; Sivakumar et al., 2003].
Relational Learning
The task of learning from relational data has received significant attention in the literature
in the last few years. One of the first approaches to relational learning was based on Inductive
Logic Programming (ILP) [Muggleton, 1992]. Inductive Logic Programming is a broad field
which evolved from the development of algorithms for the synthesis of logic programs from
examples and background knowledge to the development of algorithms for classification, re-
Page 30
13
gression, clustering, and association analysis [Dzeroski and Lavrac, 2001]. Due to its flexible
and expressive way of representing background knowledge and examples, the field considers
not only single-table representations of the data but also multiple-table representations, which
makes it a good candidate for relational learning [Blockeel and Raedt, 1997]. However, the
ILP techniques are limited in their capability to work with relational databases. Attempts to
link ILP techniques with relational databases have been made in [Lindner and Morik, 1995;
Blockeel and Raedt, 1997].
Knobbe et al. [1999] outlined a general framework for multi-relational data mining which
exploits structured query language (SQL) to gather the information needed for constructing
classifiers (e.g., decision trees) from multi-relational data. Based on this framework, multi-
relational decision tree learning algorithms have been developed[Leiva et al., 2002; Atramentov
et al., 2003] .
Probabilistic models, especially Bayesian Networks (BN) [Pearl, 2000], are similar to ILP
approaches, but specify a probability distribution over a fixed set of random variables. Several
approaches for combining first order logic and Bayesian Networks have been proposed in
the literature. The most representative ones are Probabilistic Logic Programs (PLP) [Ngo
and Haddawy, 1997], Relational Bayesian Networks (RBN) [Jaeger, 1997], and Probabilistic
Relational Models (PRM) [Koller, 1999; Getoor et al., 2001; Friedman et al., 1999]. In spite of
their different backgrounds, they all seem to share the commonalities represented by Bayesian
Logic Programs (BLP) as shown in [Kersting and De Raedt, 2000].
Approaches for mining structural data in form of graph have been also proposed in [Cook
and Holder, 2000; Gonzalez et al., 2002]. In this framework, objects in the data correspond to
vertices in the graph, and relationships between objects correspond to directed or undirected
edges in the graph. A search for patterns embedded in graphs is performed. Once a pattern
(substructure) is found, it is added to the graph in order to simplify it, by replacing instances
of the substructure with the substructure itself.
Privacy Preserving Distributed Data Mining
Several approaches to distributed data mining appeared from the need to preserve the
privacy of the information that is mined [Lindell and Pinkas, 2002]. In such case summaries
Page 31
14
of the data need to be used instead of raw data. Clifton et al. [2002] proposed a set of tools
that can be used to learn from data while preserving the privacy. Du and Atallah [2001]
designed ways to do privacy-preserving collaborative scientific computations.
Some work has focused on specific algorithms design in the presence of privacy constraints:
Du and Zhan [2002] introduced an algorithm for building decision trees on private data;
Kantarcioglu and Clifton [2002] and Schuster et al. [2004] dealt with privacy-preserving dis-
tributed mining of association rules from horizontally partitioned data, while Vaidya and
Clifton [2002] proposed an algorithm that works when data are vertically partitioned; Kar-
gupta et al. [2003] proposed an algorithm for computing correlations in a vertically distributed
scenario while preserving privacy; Lin et al. [2003] and Merugu and Ghosh [2003] presented
algorithms for privacy preserving clustering using EM mixture modeling and generative mod-
els, respectively, from horizontally distributed data, while Vaidya and Clifton [2003] proposed
a K-Means clustering over vertically partitioned data.
1.4.1.2 Architectures and Systems for Distributed Learning
Agent-oriented software engineering [Jennings and Wooldridge, 2001; Honavar et al., 1998;
Weiß, 1998] offer an attractive approach to implementing modular and extensible distributed
computing systems. Each data site has one or more associated agents that process the local
data and communicate the results to the other agents or to a central supervising agent that
controls the behavior of the local agents. Java Agents for Meta-Learning [Stolfo and oth-
ers, 1997] (distributed agent-based data mining system that uses meta-learning technique),
BODHI[Kargupta et al., 1999](hierarchical agent-based distributed system for collective data
mining), PADMA [Kargupta et al., 1997] (tool for document analysis that works on a dis-
tributed environment based on cooperative agents) systems follow the agent-based architecture
approach.
Another approach to address scalable distributed data mining is based on clusters of high-
performance workstations connected by a network link. Papyrus [Grossman et al., 2000] is a
system for mining distributed data sources on a local and wide area cluster and a super cluster
scenario. It is designed to find optimal strategies for moving results or models or data over the
Page 32
15
network. The architecture in [Ashrafi et al., 2002] is similar to JAM, PADMA and Papyrus,
except that data sources can be heterogeneous. XML technique is used for data translation.
Chattratichat et al. [1999] proposed Kensington architecture based on a distributed com-
ponent environment. Components are located on different nodes on a generic network like
Intranet or Internet. Kensington is divided into client (provides interactive creation of data
mining tasks), application server (responsible for task coordination and data management)
and third level servers (provide high performance data mining services). PaDDMAS [Rana
et al., 2000] is another component-based system similar to Kensington. As opposed to Kens-
ington, PaDDMAS allows easy insertion of custom-based components. Each component has
an interface and the connection of two components is allowed only if they have compatible
interfaces.
Krishnaswamy et al. [2003] noted that distributed data mining has evolved towards em-
bracing the paradigm of application service providers, which allows small organizations or indi-
viduals to access software on demand. They proposed an architecture that demonstrates how
distributed data mining can be integrated in application service providers in an e-commerce
environment. A user is billed based on estimated costs and response times. The architecture
proposed is based on integrating client-server and agent technologies. Sarawagi and Nagaralu
[2000] explored a similar idea.
The Knowledge grid [Cannataro et al., 2001; Cannataro and Talia, 2003; Talia, 2003;
Sunderam, 2003; Du and Agrawal, 2002] is a reference software architecture for geographically
distributed parallel and distributed knowledge application applications. It is built on top of
a computational grid that provides dependable, consistent, and pervasive access to high-end
computational resources. The Knowledge Grid uses the basic grid services and defines a set
of additional layers to implement the services of distributed knowledge discovery on world
wide connected computers where each node can be a sequential or a parallel machine. The
Knowledge Grid enables the collaboration of scientists that must mine data that are stored in
different research centers as well as analysts that must use a knowledge management system
that operates on several data warehouses located in the different company establishments
[Chervenak et al., 1999].
Page 33
16
Discovery Net project [Curcin et al., 2002; Guo, 2003] introduced the idea that complex
applications can make use of Grid technologies only if an application specific layer is intro-
duced. Thus, in the Discovery Net architecture, there exists a layer that provides support for
constructing and publishing Knowledge Discovery Services.
1.4.1.3 Distributed Learning Real World Applications
Distributed data mining algorithms can be applied to problems in various real world do-
mains, such as: network intrusion detection [Bala et al., 2002; Kumar, 2003], credit fraud
detection [Chan et al., 1999], text classification [Kuengkrai and Jaruskulchai, 2002], chain
store database of short transactions [Lin et al., 2002], geoscientific data [Shek et al., 1996],
financial data mining from mobile devices [Kargupta et al., 2002], sensor-network-based dis-
tributed databases [Bonnet et al., 2001], car-health diagnostics analysis [Wirth et al., 2001],
etc.
1.4.2Information Integration
Information integration is another problem related to the work presented in this disser-
tation. Davidson et al. [2001] and Eckman [2003] surveyed alternative approaches to data
integration. Hull [1997] summarized theoretical work on data integration. Because of our fo-
cus on knowledge acquisition from autonomous, semantically heterogeneous distributed data
sources, query-centric, federated approaches to data integration are of special interest. A
federated approach lends itself much better to settings where it is desirable to postpone spec-
ification of user ontology and the mappings between data source specific ontologies and user
ontology until when the user is ready to use the system. The choice of a query centric approach
enables users the desired flexibility in querying data from multiple autonomous sources in ways
that match their own context or application specific ontological commitments (whereas in a
source centric approach, what the data from a source should mean to a user are determined
by the source).
Early work on multi-database systems [Sheth and Larson, 1990; Barsalou and Gangopad-
hyay, 1992; Bright et al., 1992] focused on relational or object-oriented database views for
Page 34
17
integrated access to data from several relational databases. However, these efforts were not
concerned with autonomous semantically heterogeneous data sources. More recent work [Tsai
et al., 2001] used ontologies to integrate domain specific data sources. Wiederhold and Gene-
sereth [1997] proposed mediator programs to integrate heterogeneous data sources. Some
efforts at building such mediators for information integration from multiple data repositories
(including semi-structured and unstructured data) include the TSIMMIS project at Stanford
University [Garcia-Molina et al., 1997; Chang and Garcia-Molina, 1999] the SIMS project
[Arens et al., 1993] and the Ariadne project [Knoblock et al., 2001] at the University of South-
ern California, the Hermes project at the University of Maryland [Lu et al., 1995], Information
Manifold, a system developed at ATT Bell labs for querying WWW documents [Levy, 1998],
and NIMBLE – a commercial system based on research at the University of Washington
[Draper et al., 2001]. Several data integration projects have focused specifically on integration
of biological data: The SRS (Sequence Retrieval System) [Etzold et al., 2003] developed at
the European Molecular Biology Laboratory and marketed by LION Bioscience, IBM’s Dis-
coveryLink [Haas et al., 2001], the TAMBIS project in UK [Stevens et al., 2003], the Kleisli
project [Chen et al., 2003a] and its successor K2 [Tannen et al., 2003] at the University of
Pennsylvania
These efforts addressed, and to varying degrees, solved the following problems in data
integration: design of query languages and rules for decomposing queries into sub queries
and composing the answers to sub queries into answers to the initial query. In related work,
Tomasic et al. [1998] proposed an approach to scaling up access to heterogeneous data sources.
Haas et al. [1997] investigated optimization of queries across heterogeneous data sources.
Rodriguez-Martinez and Roussoloulos [2000] proposed a code shipping approach to the design
of an extensible middleware system for distributed data sources. Lambrecht et al. [1999]
proposed a planning framework for gathering information from distributed sources.
However, each of the systems summarized above has several significant limitations. SRS
provides flexible ways to navigate and aggregate information, but offers fairly limited facilities
for querying, and semantically integrating information from diverse sources. DiscoveryLink
goes a step further than SRS in that it includes an explicit data model, the relational model,
Page 35
18
which allows users to perform SQL queries over remote sources. Kleisli does not include
a model of the available data sources, but does offer a query language and a collection of
wrappers (Kleisli drivers) for accessing biological data sources. The K2 system incorporates
some of the ideas from Kleisli, but includes some features absent in Kleisli notably, data
dictionaries (for information retrieval), and a complex value model of the data which allows
data values to be constructed by arbitrarily nesting tuples, collections (sets, bags, lists) and
variants. The TAMBIS system uses a description logic formalism for representing its ontology
which facilitates subsumption reasoning. User queries in TAMBIS are formulated in terms of
the TAMBIS ontology. However, the mapping between TAMBIS ontology and data sources
is quite restrictive. It does not allow multiple sources for the same kind of data (e.g., the use
of both Swiss-Prot and PIR as sources of protein data) and it does not allow users to impose
their own ontologies on the data sources.
Few of the systems mentioned above take into account semantic relationships between
values of attributes used to describe instances (e.g., taxonomies over attribute values) in
individual data sources.
1.4.3 Learning Classifiers from Heterogeneous Data
The combination of information integration with distributed learning algorithms is still
a relatively new idea and thus there has not been much research focused on that yet. The
work in this dissertation exploits this combination. In what follows, we describe two previous
attempts to combine information integration and distributed learning, followed by an overview
of our approach in this context.
InfoGrid [Giannadakis et al., 2003] is a flexible Grid system that developed on top of Kens-
ington [Chattratichat et al., 1999] in order to answer the needs of the scientific community.
It provides data publishing and integration mechanisms for a large range of different scien-
tific applications in a generic way, while allowing specific queries for individual application
domains, as opposed to the common middleware systems where all users are supposed to use
the same language. InfoGrid achieves this functionality by introducing a layer of Informa-
tion Integration Services where the querying middleware supports language parameterization
Page 36
19
allowing specific application areas to maintain their own querying model while enabling hetero-
geneous information resources to be queried effectively. InfoGrid does not change the learning
algorithms, it only prepares the data that they need. Hence, once the data required by an
algorithm is collected, it is passed to the learning algorithm.
The work in [McClean et al., 2002] brings the information integration problem a step
closer to the learning problem by providing a way for the user to pose statistical queries in
the user ontology. Each data source has a specific ontology and meta-data that describes the
ontology and the relationship with other ontologies in the system. The authors do not assume
that a global ontology exists, as most integration systems do. However, they assume that
there exist mappings between local data source ontologies and one or several global ontologies
stored in an ontology server, as well as mappings between global ontologies. Thus, mappings
from data source ontologies to the user ontology can be found using intermediary mappings
between global ontologies. They are provided by a negotiation agent that computes them
dynamically in an automated way by searching the meta-data in the system, making the
problem of answering queries more flexible.
In related work, McClean et al. [2003] use the mappings found by the negotiation agent
to answer aggregate queries over heterogeneous distributed databases in the presence of data
inconsistencies or imprecise data (data specified at different levels of granularity) that are
likely to appear in such distributed scenarios. Thus, after a global ontology is constructed
dynamically by analyzing the meta-data that relates the heterogeneous databases, the aggre-
gates are derived by minimization of the Kullback-Leibler information divergence using the
EM (Expectation-Maximization) algorithm. Depending on the global ontology a user query
can be assessed as answerable, partially answerable, or unanswerable in advance of computing
the answer itself.
The focus of the proposed research is on learning classifiers from a set of heterogeneous
autonomous distributed data sources. The autonomous nature of the data sources implies
that the learner has little control over the manner in which the data are distributed among
the different sources. The heterogeneous nature of the data opens up a new direction that
links data mining and information integration.
Page 37
20
Unlike the papers summarized above, our approach [Caragea et al., 2004d] offers a general
approach to the design of algorithms for learning from distributed data that is provably exact
with respect to its centralized counterpart. Central to our approach is a clear separation
of concerns between hypothesis construction and extraction of sufficient statistics from data.
This separation makes it possible to explore the use of sophisticated techniques for query
optimization that yield optimal plans for gathering sufficient statistics from distributed data
sources under a specified set of constraints describing the query capabilities and operations
permitted by the data sources (e.g., execution of user supplied procedures). The proposed ap-
proach also lends itself to adaptation to learning from heterogeneous distributed data sources
when the ontologies associated with the individual data sources are different from each other
[Caragea et al., 2004b]. Thus, provided well-defined mappings between ontologies can be
specified, the proposed approach to learning from distributed data can be extended to yield
an approach to learning from heterogeneous distributed data of the sort encountered in many
large scale scientific applications.
In terms of information integration, our approach proposes a clear separation between
ontologies used for data integration (which are supplied by users) and the procedures that use
ontologies to perform data integration. This allows users to replace ontologies used for data
integration on the fly, making it attractive for data integration tasks that arise in exploratory
data analysis wherein scientists might want to experiment with alternative ontologies.
1.5 Outline
The rest of the dissertation is organized as follows:
• Chapter 2: A brief introduction to machine learning systems is given together with ways
to evaluate such systems. Several classical machine learning algorithms are presented.
A careful look at these algorithms leads to the observation that only certain statistics
about data are used in the process of generating the algorithm output, which in turn
leads to a reformulation of a learning algorithm in terms of information extraction and
Page 38
21
hypothesis generation. Sufficient statistics for the learning algorithms presented are
identified.
• Chapter 3: The problem of learning from distributed data is formally defined and a
general strategy, based on the decomposition of the algorithm into information extrac-
tion from distributed data and hypothesis generation, is proposed. We show how this
strategy can be applied to transform the algorithms introduced in Chapter 2 into ef-
ficient algorithms for learning from distributed data. We also introduce a statistical
query language for formulating and manipulating statistical queries involved in learning
algorithms.
• Chapter 4: The approach used for learning from distributed data sources is extended
to yield an approach to learning from semantically heterogeneous data sources. We
formally define ontologies and show how we can extend data sources and statistical
query operators with ontologies in order to get sound and complete answers to statistical
queries. The problem of answering queries from partially specified data is also addressed
and a solution is proposed.
• Chapter 5: A system for answering queries from distributed heterogeneous autonomous
data sources is designed. At the core of this system there is a query answering en-
gine, which receives queries, decomposes them into sub-queries corresponding to the
distributed data sources, finds optimal plans for execution, executes the plans and com-
poses the individual answers it gets back from the distributed data sources into a final
answer to the initial query.
• Chapter 6: We give an overview of INDUS, a federated, query centric approach to
learning classifiers from distributed data sources and present AirlDm, a collection of
machine learning algorithms, which are data source independent by means of sufficient
statistics and data source wrappers. We show how AirlDM can be combined with INDUS
to obtain implementations for algorithms for learning from distributed heterogeneous
autonomous data sources. A case study is presented in the end of the chapter.
Page 39
22
• Chapter 7: We conclude with a summary, a list of contributions that this dissertation
makes and several directions for future work.
Page 40
23
2 LEARNING CLASSIFIERS FROM DATA
In this Chapter we define the problem of learning from data and describe five learning
algorithms (Naive Bayes, Decision Tree Algorithm, Perceptron Algorithm, Support Vector
Machines and k-Nearest Neighbors algorithm). We show that any learning algorithm can be
decomposed into two components: an information extraction component in which sufficient
statistics for learning are collected and a hypothesis generation component in which sufficient
statistics are used to construct a hypothesis. For each of the algorithms described, we will
identify the sufficient statistics for learning.
2.1 Machine Learning Systems
Machine Learning is a multidisciplinary field that brings together scientists from artificial
intelligence, probability and statistics, computational complexity, information theory, etc. A
key objective of Machine Learning is to design and analyze algorithms that are able to improve
the performance at some task through experience [Mitchell, 1997].
Definition 2.1. A machine learning system is specified by several components:
• Learner: An algorithm or a computer program that is able to use the experience to
improve the performance. Usually the learner have finite resources (e.g.time and
memory), so it should be able to use them efficiently.
• Task: A description of the task that the learner is trying to accomplish (e.g., learn a
concept, a function, a language, etc.).
• Experience source: Specification of the information that the learner uses to perform the
learning. The experience can take various forms such as:
Page 41
24
– Examples: The learner is presented with labeled examples about a particular task.
Sometimes we refer to examples as instances
– Queries: The learner can pose queries about a task to a knowledgeable teacher.
– Experiments: The learner is allowed to experience with the task and learn from the
effects of its actions on the task.
• Background knowledge: The information that the learner has about the task before the
learning process (e.g. ”simple” answers are preferable over ”complex” answers). This
information may simplify the learning process.
• Performance Criteria: Measure the quality of the learning output in terms of accuracy,
simplicity, efficiency etc.
Definition 2.2. Let X be a sample space from where the examples are drawn and let D be
the set of all possible subsets of the sample space X. In general, we assume that the examples
are randomly chosen from an unknown distribution. A collection of examples D ∈ D is called
a data set or a data source.
Definition 2.3. Let C be the space of all possible models that we may want to learn or
approximate, and H the space of the models that a learner can draw on in order to construct
approximations of the models in C. Thus, a learning algorithm outputs an element h ∈ H,
called the hypothesis about the data.
Definition 2.4. A classification task is a task for which the learner is given experience in
the form of labeled examples, and it is supposed to learn to classify new unlabeled examples.
Thus, in a classification task, the data D typically consists of a set of training examples
. Each training example x is described by a set of attribute values < a1,··· ,an >. The
class label of an example can take any value from a finite set C = {c1,··· ,cm}. Hence,
D = {(x1,y1),··· ,(xt,yt)}, where yi∈ C for all i ∈ {1,··· ,t}. In a classification task, the
learned hypothesis h ∈ H is called a classifier (e.g., a decision tree, a support vector machine,
etc. or even the data in the case of lazy learning).
Page 42
25
Note: In this dissertation, we will concentrate on classification tasks.
Definition 2.5. For a classification task, we say that a hypothesis h is consistent with a set of
training examples if it correctly classifies all the examples in the set. The classification error
(a.k.a. sample error or empirical error) of a hypothesis with respect to a set of examples is the
fraction of examples in the set that are misclassified by h. The true error of a hypothesis h is
the probability that the hypothesis h will misclassify an example randomly chosen according
to the underlying distribution.
As the underlying distribution is unknown, we cannot measure the true error of a hypoth-
esis, but we can measure the classification error on a data set. If this is a good estimate of
the true error, we can get a good estimate for the probability of misclassifying new unlabeled
examples.
Definition 2.6. We say that a learner L is consistent if it outputs a hypothesis which is
consistent with the set of training examples.
Definition 2.7. If H is a hypothesis space that a learner L is called upon to learn and D
is a training set for the learner L, then the most probable hypothesis h ∈ H given the data
D is called a maximum a posteriori (MAP) hypothesis. According to the Bayesian theory
hMAP= argmaxh∈HP(D|h)P(h), where P(h) is the prior probability of h and P(D|h) (called
likelihood ) is the probability to observe the data D given the hypothesis h. If we assume
that all the hypotheses h ∈ H are equally likely a priori, then any hypothesis that maximizes
P(D|h) is called maximum likelihood (ML) hypothesis.
We are interested in finding maximum a posteriori hypotheses since they are optimal in
the sense that no other hypothesis is more likely. The following theorem gives us conditions
that ensure that a maximum a posteriori hypothesis is found.
Theorem 2.8. [Mitchell, 1997] “Every consistent learner outputs a MAP hypothesis, if we
assume a uniform prior probability distribution over H (i.e., P(hi) = P(hj) for all i,j), and if
we assume deterministic, noise-free training data (i.e., P(D|h) = 1 if D and h are consistent,
and 0 otherwise).”
Page 43
26
The Minimum Description Length (MDL) principle [Rissanen, 1978] provides a way to
implement Occam’s razor (“Prefer the simplest hypothesis that fits the data.”), thus making
it possible to take the complexity of a hypothesis into account when choosing the optimal
hypothesis. It achieves this by performing a trade off between the complexity of the hypothesis
and the number of errors of the hypothesis. Shorter hypotheses that make a few errors are
preferred to longer consistent hypotheses. This also ensures that the problem of over-fitting
the data is avoided.
In the next section, we will formally define the problem of learning from data by referring
back to the definitions introduced in this section.
2.2 Learning from Data
Definition 2.9. The problem of learning from data can be summarized as follows: Given
a data set D, a hypothesis class H, and a performance criterion P, the learning algorithm
L outputs a classifier h ∈ H that optimizes P. If D = {(x1,y1),··· ,(xt,yt)}, then the
training examples xifor i = 1,n represent inputs to the classifier h, while the labels yifor
i = 1,n represent outputs of the classifier h. The goal of learning is to produce a classifier that
optimizes the performance criterion of minimizing some function of the classification error
(on the training data) and the complexity of the classifier (e.g., MDL). Under appropriate
assumptions, this is likely to result in a classifier h that assigns correct labels to new unlabeled
instances.
Thus, a learning algorithm for a classification task consists of two components: a learn-
ing component when the hypothesis is learned from training examples and a classification
component when the learned hypothesis is used to classify new test examples (see Figure 2.1).
The boundary that defines the division of labor between the learning and the classification
components depends on the particular learning algorithm used. Some learning algorithms do
most of the work in the training phase (eager learning algorithms) while others do most of the
work during the classification phase (lazy learning algorithms).
While in the case of eager learning a hypothesis is constructed during the learning phase,
Page 44
27
Classification
Example
x
Classifier
h
Class
c
Data
D
Learner
L
Classifier
h
Learning
Figure 2.1 Learning algorithm (Up) Learning component (Down) Classifi-
cation component
based on the training examples, in the case of lazy learning the training examples are simply
stored and the generalization is postponed until a new instance needs to be classified. One
advantage that lazy learning algorithms have over eager learning algorithms is that the target
function is estimated locally (and thus it can be different for any new instance to be classified)
as opposed to being estimated once for all the training examples. The main disadvantage is
that the cost of classification in the case of lazy learning is higher than the cost of classification
in the case of in eager learning, where most of the work is done once during the learning phase.
2.3 Examples of Algorithms for Learning from Data
In this section, we will describe a few popular eager learning algorithms (Naive Bayes
Algorithm, Decision Tree Algorithm, Perceptron Algorithm, and Support Vector Machines)
and also a well known lazy learning algorithm (k Nearest Neighbors).
2.3.1 Naive Bayes Classifiers
Naive Bayes is a highly practical learning algorithm [Mitchell, 1997], comparable to pow-
erful algorithms such as decision trees or neural networks in terms of performance in some
domains. In Naive Bayes framework, each example x is described by a conjunction of attribute
values, i.e. x =< a1,··· ,an>. The class label of an example can take any value from a finite
set C = {c1,··· ,cm}. We assume that the attribute values are conditionally independent
given the class label. A training set of labeled examples,
D = {< x1,y1>,··· ,< xt,yt>}, is presented to the algorithm. During the learning phase,
a hypothesis h is learned from the training set. During the evaluation phase, the learner is
Page 45
28
asked to predict the classification of new instances x.
If the new instance that needs to be classified is x =< a1,··· ,an>, then according to
Bayesian decision theory, the most probable class is given by
cMAP= argmax
cj∈CP(cj|a1,··· ,an)
Using the Bayes theorem, we have:
cMAP(x) = argmax
cj∈C
P(a1,··· ,an|cj)P(cj)
P(a1,··· ,an)
= argmax
cj∈CP(a1,··· ,an|cj)P(cj)
Under the assumption that the attribute values are conditionally independent given the class
label, the probability of observing the attribute values a1,··· ,an given a class cj is equal
to the product of the probabilities for the individual attribute values for that class. Thus,
n
?
instance x =< a1,··· ,an>:
P(a1,··· ,an|cj) =
i=1
p(ai|cj), which gives the following naive Bayes classification for the
cNB(x) = argmax
cj∈CP(cj)
n
?
i=1
P(ai|cj),
where the probabilities P(cj) and P(ai|cj) can be estimated based on their frequencies over
the training data. These estimates collectively specify the learned hypothesis h, which is used
to classify new instances x according to the formula for cNB(x). The pseudocode for the Naive
Bayes classifier is shown in Figure 2.2.
We mentioned before that the probabilities P(cj) and P(ai|cj) are computed based on their
frequencies in the training data. For example, for a class c, P(c) =tc
t, where tcis the number
of training examples in class c and t is the total number of training examples. Although
this estimate is good in general, it could be poor if tcis very small. The Bayesian approach
adopted in this case is to use a k-estimate (a.k.a. Laplace estimate) of the probability, defined
as
tc+kp
t+k
[Mitchell, 1997]. Here p is a prior estimate of the probability we want to compute
(e.g., p = 1/m if there are m possible classes), and k is a constant called the equivalent sample
size (it can be thought of as an augmentation of the set of t training examples by an additional
Page 46
29
Naive Bayes Classifier
Learning Phase:
For each class cjand each attribute value aicompute the probabilities P(cj)
and P(ai|cj) based on their frequencies over the training data.
Classification Phase:
Given a new instance x =< a1,··· ,an> to be classified
Return cNB(x) = argmax
cj∈CP(cj)
n
?
i=1
P(ai|cj)
Figure 2.2Naive Bayes classifier
k virtual examples distributed according to p).
We have seen that the Naive Bayes classifier relies on the assumption that the values of
the attributes a1,··· ,an are conditionally independent given the class value c. When this
assumption is met, the output classifier is optimal. However, in general this assumption is not
valid. Bayesian Networks[Pearl, 2000] relax this restrictive assumption by making conditional
independence assumptions that apply to subsets of the variables. Thus, a Bayesian network
[Pearl, 2000] models the probability distribution of a set of variables (attributes) by specifying
a set of conditional independence assumptions and a set of conditional probabilities. Let
A1,··· ,Anbe random variables whose possible values are given by the sets V (Ai), respectively.
We define the joint space of the set of variables A1,··· ,Anto be the cross product V (A1) ×
··· × V (An), which means that each element in the joint space corresponds to one of the
possible assignments of values to the variables A1,··· ,An. The probability distribution over
these space is called joint probability distribution. A Bayesian Network describes the joint
probability distribution for a set of variables. As in the case of Naive Bayes, each probability
in the joint probability distribution can be estimated based on frequencies in the training
data. Therefore, the results presented for Naive Bayes in the next chapters can be applied to
Bayesian Networks as well.
Page 47
30
2.3.2Decision Tree Algorithm
Decision tree algorithms [Quinlan, 1986; Breiman et al., 1984] are among some of the most
widely used machine learning algorithms for building pattern classifiers from data. Their
popularity is due in part to their ability to:
• select from all attributes used to describe the data, a subset of attributes that are
relevant for classification;
• identify complex predictive relations among attributes; and
• produce classifiers that are easy to comprehend for humans.
The ID3 (Iterative Dichotomizer 3) algorithm proposed by Quinlan [Quinlan, 1986] and its
more recent variants represent a widely used family of decision tree learning algorithms. The
ID3 algorithm searches in a greedy fashion, for attributes that yield the maximum amount of
information for determining the class membership of instances in a training set D of labeled
instances. The result is a decision tree that correctly assigns each instance in D to its respective
class. The construction of the decision tree is accomplished by recursively partitioning D
into subsets based on values of the chosen attribute until each resulting subset has instances
that belong to exactly one of the m classes. The selection of an attribute at each stage of
construction of the decision tree maximizes the estimated expected information gained from
knowing the value of the attribute in question.
Consider a set of instances D which is partitioned based on the class values c1,··· ,cm
into m disjoint subsets C1,C2,...,Cm such that D =
m ?
i=1
Ciand Ci
?Cj = ∅ ∀i ?= j. The
probability that a randomly chosen instance x ∈ D belongs to the subset Cj is denoted by
pj. The entropy of a set D measures the expected information needed to identify the class
membership of instances in D, and is defined as follows: entropy(D) = −?
some impurity measure, the entropy [Quinlan, 1986] or Gini index [Breiman et al., 1984], or
jpj·log2pj. Given
any other measure that can be defined based on the probabilities pj[Buja and Lee, 2001],
we can define the information gain for an attribute a, relative to a collection of instances
D as follows: IGain(D,a) = I(D) −?
v∈V alues(a)
|Dv|
|D|I(Dv), where V alues(a) is the set of
Page 48
31
all possible values for attribute a, Dv is the subset of D for which attribute a has value v,
and I(D) can be entropy(D), Gini index, or any other suitable measure. As in the case of
Naive Bayes, the probabilities pjcan be estimated based on frequencies in the training data,
as follows: pj=
|Cj|
|D|, where we denote by | · | the cardinality of a set, and thus, the entropy
can be estimated as follows: entropy(D) = −?
algorithm is shown if Figure 2.3.
j
|Cj|
|D|· log2
?
|Cj|
|D|
?
. The pseudocode of the
To keep things simple, we assume that all the attributes are discrete or categorical. How-
ever, this discussion can be easily generalized to continuous attributes by using techniques for
discretizing the continuous-values attributes (e.g., by dividing the continuous interval where
the attribute takes values into sub-intervals that correspond to discrete bins) [Fayyad and
Irani, 1992; Witten and Frank, 1999].
Often, decision tree algorithms also include a pruning phase to alleviate the problem of
over-fitting the training data [Mitchell, 1997; Esposito et al., 1997]. For the sake of simplicity
of exposition, we limit our discussion to decision tree construction without pruning. However,
it is relatively straightforward to modify the algorithm to incorporate a variety of pruning
methods.
2.3.3Perceptron Algorithm
Let D = {(xi,yi)|i = 1,t} be a set of training examples, where yi∈ C = {0,1}. We denote
by D+= {(xi,yi)|yi= 1}, D−= {(xi,yi)|yi= 0} the sets of positive and negative examples,
respectively. We assume that they are linearly separable, which means that there exists a
linear discrimination function which has zero training error, as illustrated in Figure 2.4.
The learning task is to finding a vector, w∗, called weight vector, such that:
∀xi ∈ D+,w∗· xi > 0 and ∀xi ∈ D−,w∗· xi < 0. The perceptron algorithm [Rosenblatt,
1958] can be used for this purpose. Perceptrons are computing elements inspired by the
biological neurons [McCulloch and Pitts, 1943; Minksy and Papert, 1969]. The pseudocode
of the algorithm is presented in Figure 2.5.
Thus, we can see the perceptron weight vector as representing a separating hyperplane in
the instance space. The perceptron outputs 1 if the instances lie on one side of the hyperplane
Page 49
32
Decision Tree algorithm
Learning Phase
ID3(D,A) (D set of training examples, A set of attributes).
Create a Root node for the tree.
if (all the examples in D are in the same class ci)
{
return (the single node tree Root with label ci)
}
else
{
Let a ← BestAttribute(D)
for (each possible value v of a) do
{
Add a new tree branch below Root corresponding to the test a = v.
if (Dvis empty)
{
Below this branch add a new leaf node with
label equal to the most common class value in D.
}
else
{
Below this branch add the subtree ID3(Dv,A − a).
}
}
}
return Root.
end-learning-phase
Classification Phase
Given a new instance x, use the decision tree having root Root to classify x, as follows:
• Start at the root node of the tree, testing the attribute specified by this node
• Move down the tree branch corresponding to the value of the attribute in the given example
• Repeat the process for the subtree rooted at the new node,
until this node is a leaf which provides the classification of the instance.
Figure 2.3ID3 algorithm - greedy algorithm that grows the tree top-down,
by selecting the best attribute at each step (according to the
information gain). The growth of the tree stops when all the
training examples are correctly classified
Page 50
33
wx+b=0
o
o
o
o
o
o
o
o
o
oo
1
1
1
1
1
1
1
1
1
1
1
o
Separating
Hyperplane
Figure 2.4Linearly separable data set
and 0 if they lie on the other side. The intuition behind the update rule is to “step” in
the direction that reduces the classification error. The value η, called learning rate, specifies
the step size. The Perceptron Convergence Theorem [Minksy and Papert, 1969] guarantees
that if the data are linearly separable the algorithm will find the separating hyperplane in
a finite number of steps for any η > 0. The update rule of the Perceptron has the same
mathematical form as the gradient descent rule, which is the basis for the Backpropagation
[Rumelhart et al., 1986] algorithm. The Backpropagation algorithm is in turn, the basis for
many learning algorithms that search through spaces containing many types of hypotheses.
Therefore, the discussion related to the Perceptron algorithm applies to a large class of neuron-
based algorithms.
2.3.4Support Vector Machines and Related Large Margin Classifiers
The Support Vector Machines (SVM) algorithm [Vapnik, 1998; Cortes and Vapnik, 1995;
Scholkopf, 1997; Cristianini and Shawe-Taylor, 2000]is a binary classification algorithm. If the
data are linearly separable, it outputs a separating hyperplane which maximizes the “margin”
between classes. If data are not linearly separable, the algorithms works by (implicitly)
mapping the data to a higher dimensional space (where the data become separable) and
a maximum margin separating hyperplane is found in this space. This hyperplane in the
high dimensional space corresponds to a nonlinear surface in the original space. Because
they find a maximum margin separation, SVM classifiers are sometimes called “large margin
Download full-text