Maximum Margin Bayesian Network Classifiers
ABSTRACT We present a maximum margin parameter learning algorithm for Bayesian network classifiers using a conjugate gradient (CG) method for optimization. In contrast to previous approaches, we maintain the normalization constraints on the parameters of the Bayesian network during optimization, i.e., the probabilistic interpretation of the model is not lost. This enables us to handle missing features in discriminatively optimized Bayesian networks. In experiments, we compare the classification performance of maximum margin parameter learning to conditional likelihood and maximum likelihood learning approaches. Discriminative parameter learning significantly outperforms generative maximum likelihood estimation for naive Bayes and tree augmented naive Bayes structures on all considered data sets. Furthermore, maximizing the margin dominates the conditional likelihood approach in terms of classification performance in most cases. We provide results for a recently proposed maximum margin optimization approach based on convex relaxation [1]. While the classification results are highly similar, our CGbased optimization is computationally up to orders of magnitude faster. Marginoptimized Bayesian network classifiers achieve classification performance comparable to support vector machines (SVMs) using fewer parameters. Moreover, we show that unanticipated missing feature values during classification can be easily processed by discriminatively optimized Bayesian network classifiers, a case where discriminative classifiers usually require mechanisms to complete unknown feature values in the data first.

Conference Paper: Integer Bayesian Networks
European Conference on Machine Learning (ECML); 01/2014  SourceAvailable from: Franz PernkopfIEEE Transactions on Pattern Analysis and Machine Intelligence 01/2014; · 5.69 Impact Factor
 SourceAvailable from: de.arxiv.org
Article: Stochastic Discriminative EM
[Show abstract] [Hide abstract]
ABSTRACT: Stochastic discriminative EM (sdEM) is an onlineEMtype algorithm for discriminative training of probabilistic generative models belonging to the exponential family. In this work, we introduce and justify this algorithm as a stochastic natural gradient descent method, i.e. a method which accounts for the information geometry in the parameter space of the statistical model. We show how this learning algorithm can be used to train probabilistic generative models by minimizing different discriminative loss functions, such as the negative conditional loglikelihood and the Hinge loss. The resulting models trained by sdEM are always generative (i.e. they define a joint probability distribution) and, in consequence, allows to deal with missing data and latent variables in a principled way either when being learned or when making predictions. The performance of this method is illustrated by several text classification problems for which a multinomial naive Bayes and a latent Dirichlet allocation based classifier are learned using different discriminative loss functions.10/2014;
Page 1
1
Maximum Margin Bayesian Network Classifiers
Franz Pernkopf, Member, IEEE, Michael Wohlmayr, Student Member, IEEE,
Sebastian Tschiatschek, Student Member, IEEE
Abstract—We present a maximum margin parameter learning algorithm for Bayesian network classifiers using a conjugate gradient
(CG) method for optimization. In contrast to previous approaches, we maintain the normalization constraints of the parameters of
the Bayesian network during optimization, i.e. the probabilistic interpretation of the model is not lost. This enables to handle missing
features in discriminatively optimized Bayesian networks. In experiments, we compare the classification performance of maximum
margin parameter learning to conditional likelihood and maximum likelihood learning approaches. Discriminative parameter learning
significantly outperforms generative maximum likelihood estimation for naive Bayes and tree augmented naive Bayes structures on
all considered data sets. Furthermore, maximizing the margin dominates the conditional likelihood approach in terms of classification
performance in most cases. We provide results for a recently proposed maximum margin optimization approach based on convex
relaxation [1]. While the classification results are highly similar, our CGbased optimization is computationally up to orders of magnitude
faster. Marginoptimized Bayesian network classifiers achieve classification performance comparable to support vector machines
(SVMs) using a fewer number of parameters. Moreover, we show that unanticipated missing feature values during classification can
be easily processed by discriminatively optimized Bayesian network classifiers, a case where discriminative classifiers usually require
mechanisms to complete unknown feature values in the data first.
Index Terms—Bayesian network classifier, discriminative learning, discriminative classifiers, large margin training, missing features,
convex relaxation.
!
1
In statistical learning theory, the PAC bound on the
expected risk for unseen data depends on the empirical
risk on the training data and a measure for the general
ization ability of the empirical model which is directly
related to the VapnikChervonenkis (VC) dimension [2].
One of the most successful discriminative classifiers,
namely the support vector machine (SVM), finds a de
cision boundary which maximizes the margin between
samples of distinct classes resulting in good general
ization properties of the classifier. In contrast, conven
tional discriminative training methods that rely on the
conditional likelihood (CL) optimize only the empirical
risk, which is suboptimal. Taskar et al. [3] observed that
undirected graphical models can be efficiently trained
to maximize the margin. More recently, Guo et al. [1]
introduced the maximization of the margin to Bayesian
networks using convex optimization. Unlike in undi
rected graphical models, the main difficulty for Bayesian
networks is maintaining the normalization constraints
of the local conditional probabilities during parameter
learning. In [1], these constraints are relaxed to obtain a
convex optimization problem. However, conditions on
the graph structure are given, ensuring that the class
posterior of the relaxed problem is unchanged in case
of renormalization [4], [5]. Unfortunately, classification
INTRODUCTION
• F. Pernkopf, M. Wohlmayr, and S. Tschiatschek are with the Department
of Electrical Engineering, Laboratory of Signal Processing and Speech
Communication, Graz University of Technology, Austria.
Email:pernkopf@tugraz.at,
tschiatschek@tugraz.at
michael.wohlmayr@tugraz.at,
This work was supported by the Austrian Science Fund (Project number
P22488N23) and (Project number S10610).
results for this algorithm have only been demonstrated
on smallscale experiments. Since then, different margin
based training algorithms have been proposed for hid
den Markov models in [6], [7] and references therein.
Compared to [1], we maximize the margin in Bayesian
network classifiers using a different approach. We keep
the sumtoone constraints which maintain the proba
bilistic interpretation of the network. This has the par
ticular advantage that summing over missing variables
is still possible (as we show in this paper). However,
we no longer have a convex optimization problem.
Convex problems are desirable in many cases as any
local optimum is a global optimum. Collobert et al. [8]
show that the optimization of nonconvex loss functions
in SVMs can lead to sparse solutions (lower number of
support vectors) and accelerated training performance.
They conclude that the sacrosanct popularity of convex
approaches should not preempt the exploration of al
ternative techniques, since they may offer computational
advantages. Similar observations are reported in [7] and
in this article.
In this paper, we introduce maximum margin (MM)
parameter learning for Bayesian network classifiers us
ing a conjugate gradient (CG) method [9]. We treat two
cases of discriminative parameter learning: both opti
mization criteria (CL or MM) are optimized using a CG
algorithm. CGbased CL learning for Bayesian networks
has been introduced in [10]. Recently, we proposed to
use the extended BaumWelch (EBW) algorithm [11] for
optimizing the CL of Bayesian network classifiers [12].
In the speech community, the EBW algorithm is well
known for optimizing the CL of hidden Markov mod
els [11], [13]. EBW offers an EMlike parameter update.
Page 2
2
In fact, it is shown in [14] that the EBW algorithm resem
bles the gradient descent algorithm for discriminatively
optimizing Gaussian mixtures using a particular step
size choice in the gradient descent method. In [15], we
attempted to use EBW for MM parameter optimization
of Bayesian network classifiers. We empirically observed
similar results as for CGbased optimization, however
the EBW requires a rational objective function which
can not be guaranteed anymore. Similarly, we introduced
maximum margin learning to Gaussian mixture models
using the EBW algorithm [16].
In experiments, we compare the classification perfor
mance of generative maximum likelihood (ML) and dis
criminative MM and CL parameter learning approaches.
We show that maximizing the margin dominates the con
ditional likelihood approach with respect to classification
performance for most cases. Furthermore, we provide
results for maximum margin optimization using convex
relaxation [1]. We achieve highly similar classification
rates, whereas our CGbased margin optimization is
computationally dramatically less costly. All Bayesian
network classifiers use either naive Bayes (NB) or gener
atively and discriminatively optimized1tree augmented
naive Bayes (TAN) structures. We also provide results
for SVMs showing that marginoptimized Bayesian net
work classifiers are serious competitors – especially in
cases where smallsized and probabilistic models are re
quired. Moreover, we show experiments demonstrating
the ability of handling missing feature scenarios. We are
particularly interested in situations where unanticipated
missing feature values arise during classification, i.e.
during testing, which can be easily handled by our
discriminatively optimized Bayesian network classifiers.
Discriminative models usually require mechanisms to
first complete unknown feature values in the data –
known as data imputation – and then applying the stan
dard classification approach to the completed data. We
provide results for two imputation techniques, namely
(i) mean value imputation, i.e. the missing feature value
is replaced with the mean value of the feature over the
entire training data set; (ii) knearest neighbor (kNN)
value imputation, i.e. the mean value (for discretized
data the most frequent value) of the knearest neighbors
is used as surrogate of the missing feature value. kNN
feature value imputation is slow and requires the train
ing data to be available during classification.
The paper is organized as follows: In Section 2, we
introduce our notation and briefly review Bayesian net
works, ML parameter learning as well as NB and TAN
structures. In Section 3, we introduce MM parameter
learning. Section 4 summarizes a generative and two
discriminative structure learning algorithms used in the
experiments. In Section 5, we present experimental re
sults for phonetic classification using the TIMIT speech
1. By “discriminative structure learning”, we mean that the aim of
optimization is to learn the structure of the network by maximizing a
cost function that is suitable for reducing classification errors, such as
conditional likelihood or classification rate.
corpus [17], for handwritten digit recognition using the
MNIST [18] and USPS data sets, and for a remote sens
ing application. Furthermore, experiments for missing
feature situations are reported in Section 5.1 and 5.2. In
Section 5.3, we show results for marginbased Bayesian
network parameter optimization using convex relaxation
and provide the runtime for each of the maximum
margin parameter learning algorithms. Finally, Section 6
concludes the paper.
2BAYESIAN NETWORK CLASSIFIERS
A Bayesian network [19] B = ?G,Θ? is a directed acyclic
graph G = (Z,E) consisting of a set of nodes Z and a
set of directed edges E connecting the nodes. This graph
represents factorization properties of the distribution of a
set of random variables Z = {Z1,...,ZN+1}, where Zj
denotes the cardinality of Zj. The variables in Z have val
ues denoted by lower case letters z = {z1,z2,...,zN+1}.
We use boldface capital letters, e.g. Z, to denote a
set of random variables and correspondingly boldface
lower case letters, e.g. z, denote a set of instantiations
(values). Without loss of generality, in Bayesian network
classifiers the random variable Z1 represents the class
variable C ∈ {1,...,C}, where C is the number of
classes and X1:N = {X1,...,XN} = {Z2,...,ZN+1}
denotes the set of random variables which model the
N attributes of the classifier. In a Bayesian network
each node is independent of its nondescendants given
its parents. Conditional independencies among variables
reduce the computational effort for exact inference on
such a graph. The set of parameters which quantify
the network is represented by Θ. Each node Zj is rep
resented as a local conditional probability distribution
given its parents ZΠj. We use θj
conditional probability table entry (assuming discrete
variables); the probability that variable Zj takes on its
ithvalue assignment given that its parents ZΠjtake their
hth(lexicographically ordered) assignment, i.e. θj
PΘ
figuration assuming that the first element of h denoted as
h1is the conditioning class and the remaining elements
h\h1 are the conditioning parent attribute values. The
training data consists of M independent and identically
distributed samples S = {zm}M
where M = S. The joint probability distribution of a
sample zmis determined as
ihto denote a specific
ih=
?Zj= iZΠj= h?. Hence, h contains the parent con
m=1= {(cm,xm
1:N)}M
m=1
PΘ(Z = zm) =
N+1
Y
N+1
Y
j=1
PΘ
“
Y
Zj = zm
jZΠj= zm
Πj
”
=
j=1
Zj
Y
i=1
h
“
θj
ih
”uj,m
ih,
(1)
where we use uj,m
form, i.e. uj,m
ihto represent the mthsample in binary
ih= 1 1n
zm
notes the indicator function, i.e. it equals 1 if the Boolean
j=i and zm
Πj=h
o. Symbol 1 1{i=j}de
Page 3
3
expression i = j is true and 0 otherwise. The class labels
are predicted using the maximum aposteriori (MAP)
estimate obtained by Bayes rule, i.e.
PΘ(C = cX1:N = xm
1:N) =
PΘ(C = c,X1:N = xm
PC
1:N)
c′=1PΘ(C = c′,X1:N = xm
1:N)
,
where the most likely class c∗is determined as c∗=
argmaxc′∈{1,...,C}PΘ(C = c′X1:N= xm
For the sake of brevity, we only notate instantiations
of the random variables in the sequel.
1:N).
2.1 Generative ML Parameter Learning
The log likelihood function of a fixed structure of B is
LL(BS) =
M
X
m=1
N+1
X
j=1
Zj
X
i=1
X
h
uj,m
ihlog
“
θj
ih
”
.
Maximizing LL(BS) leads to the ML estimate of the
parameters
PM
m=1
θj
ih=
m=1uj,m
PZj
ih
PM
l=1uj,m
lh
,
using Lagrange multipliers to constrain the parame
ters to a valid normalized probability distribution, i.e.
?Zj
2.2Discriminative CL Parameter Learning
i=1θj
ih= 1.
Maximizing CL is tightly connected to minimizing the
empirical risk. Unfortunately, CL does not decompose as
ML does. Consequently, there is no closedform solution.
The conditional log likelihood (CLL) is
CLL(BS) = log
M
Y
2
m=1
PΘ(cmxm
1:N)
(2)
=
M
X
m=1
4logPΘ(cm,xm
1:N) − log
C
X
c=1
PΘ(c,xm
1:N)
3
5.
A conjugate gradient algorithm [10], [20] or the EBW
method [12] have been proposed for maximizing
CLL(BS). For the sake of completeness, we shortly
sketch the CG algorithm for CL optimization in the
Appendix.
2.3 Structures
In this work, we restrict our experiments to NB and
TAN structures defined in the next paragraphs. The NB
network assumes that all the attributes are condition
ally independent given the class label. This means that,
given C, any subset of X is independent of any other
disjoint subset of X. As reported in the literature [21],
[22], the performance of the NB classifier is surprisingly
good even if the conditional independence assumption
between attributes is unrealistic or even false in most
of the data. Reasons for the utility of the NB classifier
range between benefits from the bias/variance tradeoff
perspective [21] to structures that are inherently poor
from a generative perspective but good from a discrimi
native perspective [23]. The structure of the naive Bayes
classifier represented as a Bayesian network is illustrated
in Figure 1(a).
(a)
C
X1
X2
X3
XN
(b)
C
X1
X2
X3
XN
Fig. 1. Bayesian Network: (a) NB, (b) TAN.
In order to overcome some of the limitations of the
NB classifier, Friedman et al. [21] introduced the TAN
classifier. A TAN is based on structural augmentations
of the NB network: Additional edges are added between
attributes. Each attribute may have at most one other
attribute as an additional parent which means that the
treewidth of the attribute induced subgraph is unity2,
i.e. we have to learn a 1tree over the attributes. The
maximum number of edges added to relax the indepen
dence assumption between the attributes is N −1. Thus,
two attributes might not be conditionally independent
given the class label in a TAN. An example of a TAN
network is shown in Figure 1(b).
A TAN network is typically initialized as a NB net
work and additional edges between attributes are de
termined through structure learning. Hence, TAN struc
tures are restricted such that the class node remains
parentless, i.e. CΠ = ∅. An extension of the TAN
network is to use a ktree, i.e. each attribute can have
a maximum of k attribute nodes as parents. In [20], we
noticed that 2trees over the features do not improve
classification performance significantly without regular
ization. Therefore, we limit the experiments to NB and
TAN structures. Many other network topologies have
been suggested in the past – a good overview is provided
in [26].
2. The treewidth of a graph is defined as the size (i.e. number
of variables) of the largest clique of the moralized and triangulated
directed graph minus one. Since there are commonly multiple trian
gulated graphs, the treewidth is defined by the triangulation where
the largest clique has the fewest number of variables. More details are
given in [24], [25] and references therein.
Page 4
4
3
TER LEARNING
DISCRIMINATIVE MARGINBASED PARAME
The proposed CGbased maximum margin learning al
gorithm is developed in the following sections.
3.1 Maximum Margin Objective Function
The multiclass margin [1] of sample m can be expressed
as
˜dm
Θ= min
c?=cm
PΘ(cmxm
PΘ(cxm
1:N)
1:N)
=
PΘ(cm,xm
maxc?=cm PΘ(c,xm
1:N)
1:N).
(3)
Sample m is correctly classified if and only if˜dm
We replace the maximum operator by the differentiable
softmax function maxxf(x) ≈ log[?
rameterized by η, where η ≥ 1 and f (x) is non
negative [6]. In the limit of η → ∞ the approximation
approaches the maximum operator.3Using this we can
define the approximate multiclass margin dm
the logarithm we obtain
Θ> 1.
xexp(ηf(x))]
1
ηpa
Θ. Taking
logdm
Θ= logPΘ(cm,xm
1:N) −1
ηlog
X
c?=cm
(PΘ(c,xm
1:N))η.
(4)
Usually, the maximum margin approach maximizes the
margin of the sample with the smallest margin for a
separable classification problem [27], i.e. the objective
is to maximize minm=1,...,Mlogdm
problem, we aim to relax this by introducing a soft mar
gin, i.e. we focus on samples with logdm
For this purpose, we consider the hinge loss function
Θ. For a nonseparable
Θclose to zero.
f
M (BS) =
M
X
m=1
min(1,λlogdm
Θ),
where the scaling parameter λ > 0 controls the margin
with respect to the loss function and is set by cross
validation. Maximizing this function with respect to
the parameters Θ implicitly increases the logmargin,
whereas the emphasis is on samples with λlogdm
i.e. samples with a large positive margin have no impact
on the optimization. Maximizing?
the derivative at λlogdm
Θ= 1. Therefore, we propose to
use a smooth hinge function hκ(y) inspired by the Huber
loss [28] which is differentiable in R and has a similar
shape as min[1,y]:
Θ< 1,
M (BS) using CG is
not straight forward due to the nondifferentiability in
hκ(y) =
8
>
>
:
<
y + κ,
if y ≤ 1 − 2κ,
if 1 − 2κ < y < 1, and
if y ≥ 1,
1 −
1,
(y−1)2
4κ
,
(5)
3. Empirical results showed that the performance of the algorithm is
not sensitive to the choice of η for η ≥ 5. The case η = 1 resembles the
classical softmax function which empirically showed a slightly inferior
performance.
where κ parameterizes this loss function. For κ → 0 the
smooth hinge function approaches min(1,y). This func
tion requires to divide the data S into three partitions
depending on ym= λlogdm
where ym≤ 1−2κ, S2
Θconsists of samples with a margin
in the range 1 − 2κ < ym< 1, and S3
The smooth hinge function hκ(y) parameterized by κ is
shown in Figure 2.
Θ, i.e. S1
Θcontains samples
Θ= S \?S1
Θ∪ S2
Θ
?.
−1−0.500.5
y
11.52
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
hκ(y)
hinge
smooth hinge (κ=0.5)
smooth hinge (κ=0.1)
quadratic
linear
Fig. 2. Differentiable approximation of the hinge loss for
κ = 0.5 and κ = 0.1.
Similar as in [29] we empirically identified typical
values of κ in the range between 0.01 and 0.5. Tuning
parameter κ in the given range has a moderate impact on
the performance (as we show in experiments). Hence, we
suggest to fix this parameter in case of time constraints.
Finally, using the introduced smooth hinge loss our
objective function for margin maximization is
M (BS) =
X
m∈S1
Θ
X
(λlogdm
Θ+ κ)
+
m∈S2
Θ
»
1 −(λlogdm
Θ− 1)2
4κ
–
+ S3
Θ.
(6)
This function is differentiable and can be optimized by
CG methods.
3.2 CG Algorithm
We use a conjugate gradient algorithm with linesearch
[30] which requires both the objective function (6) and
its derivative. In particular, the PolakRibiere method is
used [9]. The probability θj
and?Zj
the conjugate gradient algorithm we reparameterize the
problem according to
ihis constrained to θj
ih≥ 0
i=1θj
ih= 1. To incorporate these constraints in
θj
ih=
exp
“
βj
ih
“
”
βj
PZj
l=1exp
lh
”,
Page 5
5
where βj
ih∈ R is unconstrained. The CG algorithm
requires the gradient
∂βj
ih
the chain rule as
∂M(BS)
which is obtained using
∂M (BS)
∂βj
ih
=
Zj
X
k=1
∂M (BS)
∂θj
kh
∂θj
kh
∂βj
ih
.
(7)
3.3Derivatives
The derivative of
∂M(BS)
∂Θ
in Eq. (7) is
∂M (BS)
∂θj
ih
=
M
X
m=1
sm,λ
Θ
∂ logdm
∂θj
Θ
ih
,
where sm,λ
by
Θ
denotes a sample dependent weight given
sm,λ
Θ
=
8
>
>
:
<
λ,
−λ
0,
if m ∈ S1
if m ∈ S2
if m ∈ S3
Θ,
Θ, and
Θ.
2κ(λlogdm
Θ− 1),
(8)
When determining the derivative logdm
to distinguish among two cases: For TAN and NB
structures each parameter θj
value, either C = i for j = 1 or C = h1for j > 1 where h1
denotes the class instantiation h1∈ h. Due to this fact, at
most one summand is nonzero when differentiating the
term?
Case A: For the class variable, i.e. j = 1 and h = ∅, the
derivative of Eq. (4) after introducing the joint probabil
ity of Eq. (1) results in
Θwe have
ihinvolves the class node
c?=cm(PΘ(c,xm
1:N))ηin Eq. (4) with respect to θj
ih.
∂ logdm
∂θ1
Θ
i
=u1,m
i
θ1
i
− 1 1{i?=cm}Vm
i
θ1
i
,
where we set Vm
i
to
Vm
i
=
[PΘ(i,xm
P
1:N)]η
C
c?=cm[PΘ(c,xm
1:N)]η
.
Case B: For the attribute variables, i.e. j > 1, we
differentiate correspondingly and have
∂ logdm
∂θj
Θ
ih
=
uj,m
ih
θj
ih
− 1 1{h1?=cm}Vm
h1
vj,m
ih\h1
θj
ih
,
where vj,m
ih\h1= 1 1n
zm
j=i and zm
∂M(BS)
∂Θ
Πj=h\h1
for Case A and Case B is
o.
Hence, the gradient
∂M (BS)
∂θ1
i
=
M
X
m=1
sm,λ
Θ
θ1
i
ˆu1,m
i
− 1 1{i?=cm}Vm
i
˜
and
∂M (BS)
∂θj
ih
=
M
X
m=1
sm,λ
Θ
θj
ih
h
uj,m
ih− 1 1{h1?=cm}Vm
h1vj,m
ih\h1
i
,
respectively. These derivatives are further used in Eq. (7)
resulting in the required gradient for the CG algorithm.
Hence, for Case A we obtain
∂M (BS)
∂β1
i
=
M
X
m=1
sm,λ
Θ
ˆu1,m
C
X
i
− 1 1{i?=cm}Vm
i
˜
− θ1
i
M
X
m=1
c=1
sm,λ
Θ
ˆu1,m
c
− 1 1{c?=cm}Vm
c
˜,
and for Case B we have
∂M (BS)
∂βj
ih
=
M
X
m=1
sm,λ
Θ
h
uj,m
ih− 1 1{h1?=cm}Vm
h1vj,m
ih\h1
i
− θj
ih
M
X
m=1
Zj
X
l=1
sm,λ
Θ
h
uj,m
lh− 1 1{h1?=cm}Vm
h1vj,m
lh\h1
i
.
4STRUCTURE LEARNING
This section provides three structure learning heuristics
– one generative and two discriminative ones – used in
the experiments in Section 5. Note that the parameters
during structure learning are optimized generatively
using maximum likelihood estimation [19].
4.1 Generative Structure Learning
The conditional mutual information (CMI) [31] between
the attributes given the class variable is computed as:
I (Xi;XjC) = EP(Xi,Xj,C)
?
log
P (Xi,XjC)
P (XiC)P (XjC)
?
,
where EP(X)[f(X)] denotes the expectation of f(X)
with respect to P (X). It measures the information be
tween Xiand Xj in the context of C. In [21], an algo
rithm for constructing TAN networks using this measure
is provided. We briefly review this algorithm in the
following:
1) Compute the pairwise CMI I (Xi;XjC) for all 1 ≤
i ≤ N and i < j ≤ N.
2) Build an undirected 1tree using the maximal
weighted spanning tree algorithm [19] where
each edge connecting Xi and Xj is weighted by
I (Xi;XjC).
3) Transform the undirected 1tree into a directed tree.
That is, select a root variable and direct all edges
away from this root. Add to this tree the class
node C and the edges from C to all attributes
X1,...,XN.
This generative structure learning method is abbreviated
as CMI in the experiments.
Page 6
6
4.2Greedy Discriminative Structure Learning
This method proceeds as follows: a network is initialized
to NB and at each iteration an edge is added that
gives the largest improvement of the scoring function,
while maintaining a partial 1tree. Basically, two scoring
functions have been considered: the classification rate
(CR) [32], [33]
CR(BSV) =
1
MV
MV
X
m=1
1 1{cm=arg maxc′ PΘ(c′xm
1:N)}
and the CL [34]
CL(BSV) =
MV
Y
m=1
PΘ(cmxm
1:N),
where SV = {(cm,xm
MV = SV.
The process of adding edges is terminated when there
is no edge which further improves the score. Thus, it
might result in a partial 1tree (forest) over the attributes.
This approach is computationally expensive since each
time an edge is added, the scores for all O?N2?
need to be reevaluated due to the discriminative non
decomposable scoring functions we employ. Overall, for
learning a ktree structure, O?Nk+2?
are necessary. In our experiments, we consider the CR
score which is directly related to the empirical risk in [2].
The CR is the discriminative criterion that, given suffi
cient training data, most directly evaluates the objective
(small error rate), while an alternative would be to use a
convex upperbound on the 0/1loss function [35]. Since
we are optimizing over a constrained model space (k
trees) regularization is implicit. The CR evaluation can
be accelerated by techniques presented in [20], [36]. In
the experiments this greedy heuristic is labeled as TAN
CR for 1tree structures.
Recently, the maximum margin score was introduced
for discriminatively optimizing the structure of Bayesian
network classifiers [36]. As a search heuristic simulated
annealing was used, which offers mechanisms to escape
from locally optimal solutions. The maximum margin
optimized Bayesian network structures achieve good
classification performance.
1:N)}MV
m=1is the validation data and
edges
score evaluations
4.3Orderbased Discriminative Structure Learning
In [37], [20], an orderbased greedy algorithm (OMICR)
has been introduced which is able to find a discrimina
tive TAN structure with only O?N2?score evaluations.
The orderbased algorithm consists of 2 steps:
1) Establishing an ordering: First, a total ordering
≺ of the variables X1:N according to the CMI is
established. The feature that is most informative
about C is selected first. The next node in the
order is the node that is most informative about
C conditioned on the first node. More specifically,
this step determines an ordered sequence of nodes
X1:N
≺≺
=?X1
Xj
≺← arg
≺,X2
≺,...,XN
?according to
?
max
X∈X1:N\X1:j−1
≺
I
?
C;XX1:j−1
≺
??
,
where j ∈ {1,...,N}.
2) Selecting parents with respect to a given order to
form a ktree: Once the variables are ordered X1:N
the parent XΠj ∈ XΠj = X1:j−1
(j ∈ {3,...,N}) is selected. In case of a small size
of XΠj(i.e. N) and of k a computational costly
scoring function to find XΠjcan be used. Basically,
either the CL or the CR can be used as cost function
to select the parents for learning a discriminative
structure. We restrict our experiments to CR for
parent selection (empirical results showed a better
performance). The parameters are trained using
ML learning. A parent is connected to Xj
when CR is improved. Otherwise Xj
entless (except C). This might result in a partial
1tree (forest) over the attributes.
The classification results of the orderbased greedy algo
rithm are not statistical significantly different compared
to the greedy algorithm. Similarly, the SuperParent al
gorithm [32] is almost as efficient as OMICR achieving
slightly lower classification performance [20].
≺ ,
≺
for each Xj
≺
≺only
≺is left par
5EXPERIMENTS
We present results for framebased phonetic classifica
tion using the TIMIT speech corpus [17], for handwritten
digit recognition using the MNIST [18] and the USPS
data, and for a remote sensing application. In the fol
lowing, we list the used structure learning algorithms
for TAN networks:
• TANCMI: Generative TAN structure learning using
conditional mutual information (CMI).
• TANCR: Discriminative TAN structure learning us
ing the naive greedy heuristic.
• TANOMICR: Discriminative TAN structure learn
ing using the efficient orderbased heuristic.
Once the structure has been determined discrimina
tive parameter learning is performed. This is either
done using the proposed CG algorithm to maximize the
margin, labeled as CGMM (see Section 3), or the CL
method (see Section 2.2). Additionally, we show results
for marginbased Bayesian network optimization using
convex relaxation, denoted as CVXMM, and provide the
computational costs for both algorithms.
The parameters are initialized to the ML estimates for
all discriminative parameter learning methods.4Similar
as in [10] we use cross tuning to estimate the optimal
number of iterations for the CG algorithm to avoid
overfitting. Additionally, the value of λ ∈ [0.001,...,0.5]
4. Empirical results showed that the initialization of the Bayesian
network to the ML estimates for MM or CL optimization performs
better than pure random initialization.
Page 7
7
and κ ∈ [0.01,...,0.5] resulting in the best classification
rate is obtained empirically using crosstuning. We note
that instead of early stopping also regularization of the
parameters can be used to avoid overtraining of the
models. In [5], concave priors have been suggested, how
ever, ℓ1or ℓ2regularization in the unconstrained space of
βj
ihis an alternative. In any case, a weight measuring the
tradeoff between objective function and regularization
term has to be determined by crossvalidation. So there
is no benefit. Empirically, we could not observe any
advantage of regularization over early stopping in terms
of achieved classification performance.
Continuous features were discretized using recursive
minimal entropy partitioning [38] where the quantiza
tion intervals were determined using only the training
data. Zero probabilities in the conditional probability
tables are replaced with small values ε. Further, we
used the same data set partitioning for various learning
algorithms.
5.1
Classification
Handwritten Digit Recognition and Phonetic
5.1.1 Data Characteristics
In the following, we provide details about the used data
sets:
TIMIT4/6 Data: This data set is extracted from the
TIMIT speech corpus using the dialect speaking region
4 which consists of 320 utterances from 16 male and 16
female speakers. Speech frames are classified into either
four or six classes using 110134 and 121629 samples,
respectively. Each sample is represented by 20 mel
frequency cepstral coefficients (MFCCs) and wavelet
based features. We perform classification experiments on
data of male speakers (Ma), female speakers (Fe), and
both genders (Ma+Fe), all in all resulting in 6 distinct
data sets (i.e. Ma, Fe, Ma+Fe × 4 and 6 classes). The
data have been split into 2 mutually exclusive subsets
where 70% is used for training and 30% for testing. More
details about the features can be found in [39].
MNIST Data: We present results for the handwritten
digit MNIST data [18] which contains 60000 samples for
training and 10000 digits for testing. We downsample
the graylevel images by a factor of two which results in
a resolution of 14 × 14 pixels, i.e. 196 features.
USPS Data: This data set contains 11000 uniformly
distributed handwritten digit images from zip codes of
mail envelopes. The data set is split into 8000 images for
training and 3000 for testing. Each digit is represented
as a 16 × 16 grayscale image, where again each pixel is
considered as feature.
5.1.2 Results
Tables 1, 2, and
MNIST, USPS, and the six TIMIT4/6 data sets for
3 show the classification rates for
various learning methods.5Additionally, we provide
classification performances for SVMs using a radial basis
function (RBF) kernel.6In particular, for TIMIT4/6 we
only show results for the NB structure. The reason is that
the final step of MFCC feature extraction involves a dis
crete cosine transform, i.e. the features are decorrelated.
Hence, we empirically observed that the independence
assumptions of the NB structure is a good choice for
these data sets.
TABLE 1
Classification results in [%] for MNIST data with standard
deviation. Best parameter learning results for each
structure are emphasized using bold font.
Parameter Learning
CGMM
91.82±0.27
94.70±0.22
94.94±0.22
95.12±0.22
Classifier
NB
TANCMI
TANOMICR
TANCR
SVM (C∗= 1,σ = 0.01)
ML CGCL
91.70±0.28
93.80±0.24
93.39±0.25
93.94±0.24
83.73±0.37
91.28±0.28
92.01±0.27
92.58±0.26
96.40±0.19
TABLE 2
Classification results in [%] for USPS data with standard
deviation. Best parameter learning results for each
structure are emphasized using bold font.
Parameter Learning
CGMM
95.23±0.39
95.23±0.39
95.70±0.37
96.30±0.34
Classifier
NB
TANCMI
TANOMICR
TANCR
SVM (C∗= 1,σ = 0.005)
MLCGCL
93.67±0.44
94.87±0.40
94.90±0.40
95.83±0.36
87.10±0.61
91.90±0.50
92.40±0.48
92.57±0.48
97.86±0.26
TABLE 3
Classification results in [%] for TIMIT4/6 data with
standard deviation. Best results for each data set are
emphasized using bold font.
NBSVM
C∗= 1
σ = 0.05
92.49±0.14
93.30±0.20
92.14±0.21
86.24±0.18
87.19±0.25
86.19±0.25
Parameter Learning
CGMM
92.09±0.15
92.97±0.20
91.57±0.21
85.43±0.18
86.20±0.26
84.85±0.26
Data
Ma+Fe4
Ma4
Fe4
Ma+Fe6
Ma6
Fe6
MLCGCL
92.12±0.16
92.81±0.20
91.57±0.22
85.41±0.18
86.28±0.26
85.12±0.26
87.90±0.15
88.69±0.25
87.67±0.25
81.82±0.20
82.26±0.28
81.93±0.28
Average 84.8588.6788.69
89.38
The classification rate is improving for more complex
structures using ML parameter learning. Discrimina
5. The average CR over the six TIMIT4/6 data sets is determined
by weighting the CR of each data set with the number of samples in
the test set. These values are accumulated and normalized by the total
amount of samples in all test sets.
6. The SVM uses two parameters C∗and σ, where C∗is the penalty
parameter for the errors in the nonseparable case and σ is the variance
parameter for the RBF kernel.
Page 8
8
tively optimized structures, i.e. TANOMICR and TAN
CR significantly outperform generatively learned, i.e.
TANCMI and NB structures. Discriminative parameter
learning produces a significantly better classification per
formance than ML parameter learning on the same clas
sifier structure. This is especially valid for cases where
the structure of the underlying model is not optimized
for classification [10], i.e. NB and TANCMI.
MM parameter optimization outperforms CL learning
for most data sets. However, SVMs outperform our
discriminative Bayesian network classifiers on all data
sets. For TIMIT4/6 one reason might be that SVMs are
applied to the continuous feature domain. In Table 4
we compare the model complexity, i.e. the number of
parameters, between SVMs and the best performing
Bayesian network classifier. This table reveals that the
Bayesian network uses ∼ 108, ∼ 66, ∼ 212, and ∼ 259
times fewer parameters than the SVM for MNIST, USPS,
Ma+Fe4, and Ma+Fe6, respectively. It is a wellknown
fact that the number of support vectors in classical SVMs
increases linearly with the number of training sam
ples [8]. In contrast, the structure of Bayesian network
classifiers naturally limits the number of parameters. A
substantial difference is that SVMs determine the num
ber of support vectors automatically while in the case of
Bayesian networks the number of parameters is given
by the cardinality of the variables and the structure. In
this way, the model complexity can be easily controlled
by constraints on the structure. We use crosstuning to
select C∗and σ for SVMs and parameter λ, κ, and the
number of CG iterations for MM learning of Bayesian
networks.
In contrast to SVMs, the used Bayesian network
structures are probabilistic generative models – even
when discriminatively learned. They might be preferred
since it is easy to work with missing features, domain
knowledge can be directly incorporated into the graph
structure, and it is easy to work with structured data. In
this paragraph, we demonstrate that a discriminatively
optimized generative model still offers its advantages
in the missing feature case. Our MM parameter learn
ing keeps the sumtoone constraint of the probability
distributions. Therefore, we suggest, similarly to the
generatively optimized models, to sum over the missing
feature values. The interpretation of marginalizing over
missing features is delicate since the discriminatively op
timized parameters might not have anything in common
with consistently estimated probabilities (such as e.g.
maximum likelihood estimation). However, at least em
pirically there is a strong support for using the density
P(C,X′) =?
of the features X1:N. This computation is tractable if the
complexity class of P(C,X1:N) is limited (e.g. 1tree) and
the variable order in the summation is chosen appro
priately. In contrast, classical discriminative models are
inherently conditional and it is not possible to obtain
p(CX′) from p(CX1:N). In particular, this holds for
SVMs, logistic regression, and multilayered perceptrons.
X1:N\X′P(C,X1:N) where X′is a subset
These models commonly require imputation techniques
to first complete missing feature values in the data. Then
the classification approach is applied on the completed
data.
We are particularly interested in the case where ar
bitrary sets of missing features for each classification
sample can occur during testing.7In such a case, it is
not possible to retrain the model for each potential
set of missing features without also memorizing the
training set. In Figure 3(a), we present the classification
performance of discriminative and generative structures
using ML parameter learning on the MNIST data as
suming missing features. The xaxis denotes the number
of missing features. The curves are the average over
100 classifications of the test data with uniformly at
random selected missing features. We use exactly the
same missing features for each classifier. We observe that
discriminatively structured Bayesian network classifiers
outperform TANCMIML even in the case of missing
features. This demonstrates, at least empirically, that
discriminatively structured generative models do not
lose their ability to impute missing features.
In Figure 3(b), we show for the same data set and
experimental setup that the classification performance
of a discriminatively parameterized NB classifier may
be superior to a generatively parameterized NB model
in the case of missing features. In particular, this advan
tage holds for up to ∼80 missing features. For a larger
number of missing features the performance of NBML
is more robust. Additionally, NBCGMM seems to be
more robust to increasing number of missing features
compared to NBCGCL. This can be attributed to the
better generalization property of a marginoptimized
classifier.
5.2 Remote Sensing
We use a hyperspectral remote sensing image of the
Washington D.C. Mall area containing 191 spectral bands
having a spectral width of 510 nm.8As ground reference
a classification performed at Purdue University was used
containing 7 classes, namely roof, road, grass, trees, trail,
water, and shadow.9The aerial image using bands 63,
52, and 36 for red, green, and blue colors, respectively,
and the reference image are shown in Figure 4(a) and
(b). The image contains 1280 × 307 hyperspectral pixels,
i.e. 392960 samples. We arbitrarily choose 5000 samples
of each class to learn the classifier. This remote sensing
application is in particular interesting for our classifiers
since spectral bands might be missing or should be ne
glected due to atmospheric effects. For example radiation
within the visible range should be neglected in case of
clouds or darkness.
7. Note that we do not consider missing features during training of
the classifiers.
8. http://cobweb.ecn.purdue.edu/˜biehl/MultiSpec/hyperspectral.
html
9. http://cobweb.ecn.purdue.edu/˜landgreb/Hyperspectral.Ex.html
Page 9
9
TABLE 4
Model complexity for best Bayesian network (BN) and SVM.
Data
MNIST
USPS
N
196
256
20
20
191
Number of SVs
17201
3837
13146
24350
11934
Number of SVM parameters
3371396
982272
262920
487000
2279394
Number of BN parameters
31149
14689
1239
1877
62566
TIMIT4/6 (Ma+Fe4)
TIMIT4/6 (Ma+Fe6)
Washington D.C. Mall
(a)
(b)
Fig. 3. Classification performance on MNIST assuming
missing features. The xaxis denotes the number of miss
ing features and the shaded regions correspond to the
standard deviation over 100 classifications: (a) Different
structure learning methods with generative parameter
ization; (b) Different discriminative parameter learning
methods on NB structure.
We use various introduced generative and discrimina
tive parameter learning algorithms on the NB network
structure. The classification performances are shown in
Table 5.
Remarkably, NBCGMM slightly outperforms SVMs
in this experiment. Additionally, the Bayesian network
employs ∼ 36 times fewer parameters than the SVM (see
Table 4). Figure 5 shows the influence of parameter κ in
the loss function (6) for λ = 0.02 on the classification
performance. The classification rate slightly improves
(a)(b)
Fig. 4. Washington D.C. Mall: (a) Pseudo color image of
spectral bands 63, 52, and 36; (b) Reference image.
TABLE 5
Classification results in [%] for Washington D.C. Mall data
with standard deviation. Best parameter learning result is
emphasized using bold font.
NB SVM
C∗= 1
σ = 0.05
88.98
±0.05
Parameter Learning
MLCGMM
81.07
89.34
±0.06
±0.05
CGCL
87.01
±0.05
for κ = 0.5. However, the impact is moderate. Note
that the selection of κ is based on the crossvalidation
performance on the training data.
Similar as for MNIST in Section 5.1, we report clas
sification results for NBML, NBCGMM, and NBCG
CL assuming at random missing features during clas
sification in Figure 6. The xaxis denotes the number
of missing features. We average the performances over
100 classifications of the test data with randomly miss
ing features. The standard deviation indicates that the
resulting differences are significant for a moderate num
ber of missing features. Discriminatively parameterized
Page 10
10
00.10.20.30.40.5
κ
0.60.70.80.91
88.6
88.8
89
89.2
89.4
Classification Performance [%]
Fig. 5. Influence of parameter κ of the loss function on
the classification rate for λ = 0.02.
NB classifiers outperform NBML in the case of up to
150 missing features. Furthermore, we present results
for SVMs where first imputation methods are used to
complete missing feature values in the data. Afterwards,
SVMs are applied on the completed data. In particular,
we use two imputation approaches: (i) mean value im
putation (the missing value is replaced with the mean
value of the feature of the training data set); (ii) kNN
value imputation – the missing value is replaced with the
mean value (for discretized data the most frequent value)
of the knearest neighbors. The neighbors of a sample
with missing features are determined by the Euclidean
distance in the relevant subspace. In the special case
where k equals the number of training instances M this
method is identical to mean value imputation. We use
k = 5. As shown in Figure 6, mean value imputation de
grades the classification performance of SVMs in case of
missing features significantly. Handling missing features
with NB classifiers is easy since we can simply neglect
the conditional probability of the missing feature Zj in
Eq. (1), i.e. the joint probability is the product of the
available features only.
Fig. 6. Washington D.C. Mall: Classification results for
NBML, NBCGMM, NBCGCL, and SVMs (using mean
value imputation) assuming missing features.
Figure 7 shows kNN value imputation results for
SVMs and NBCGMM. kNN feature value imputation is
slow and requires the training data during classification
of samples with at random missing features. However,
it provides more information to the classifier compared
to simple summation over the missing feature values as
shown for the NBCGMM case.
Fig. 7. Washington D.C. Mall: Classification results for
NBCGMM (summation over missing feature values),
NBCGMM (using kNN most frequent value imputation),
and SVMs (using kNN mean value imputation) assuming
missing features.
5.3Margin Optimization using Convex Relaxation
In this section, we compare our CGbased margin opti
mization to a recently proposed approach using convex
relaxation [1] in terms of classification accuracy and
computational efficiency. First we provide a short intro
duction to convex relaxation for margin maximization
and give details on solving the convex problem for our
data. Unfortunately, Guo et al. [1] only provided results
on smallscale experiments, i.e. 50 samples and up to 36
features.
5.3.1
Guo et al. [1] proposed to solve the maximum mar
gin parameter learning problem for Bayesian network
classifiers by reformulating it as a convex optimization
problem. They introduced the parameter vector w with
elements wj
ih) (in some order) and, using the
same order for the elements, the feature vectors φ(zm)
with elements uj,m
can be written as PΘ(Z = zm) = exp(φ(zm)Tw), where
φ(zm)Tdenotes the transpose of φ(zm). The logarithm of
the multiclass margin (3) of the mth sample becomes
Background
ih= log(θj
ih. Then, the probability of sample zm
logdm
Θ= min
c?=cm[φ(cm,xm
1:N) − φ(c,xm
1:N)]Tw.
In this way, the problem of learning the maximum
margin parameters of the Bayesian network can be recast
as
maximize
γ,w
γ
s.t. ∆m,cw ≥ γ,
∀m and c ?= cm,
γ ≥ 0,
Zj
X
i=1
exp(wj
ih) = 1,
∀j,h,
Page 11
11
where γ is the logarithm of the minimum of all sample
margins and ∆m,c= [φ(cm,xm
constraint ensures that all sample margins are greater
than γ and the third constraint that w parameterizes a
valid Bayesian network, i.e. w describes valid probability
distributions.
Finally, by introducing one slack variable ǫmfor each
sample zm, relaxing the constraints on the parameter
vector w and rewriting the objective function, Guo et
al. derived the optimization problem
1:N) − φ(c,xm
1:N)]T. The first
minimize
γ,w,ǫ1,...,ǫM
1
2γ2+ B
M
X
m=1
ǫm
(9)
s.t. ∆m,cw ≥ γ − ǫm,
∀m and c ?= cm,
γ ≥ 0,
Zj
X
∀m,
i=1
exp(wj
ih) ≤ 1,
∀j,h,
ǫm ≥ 0,
for determining the maximum margin parameters. The
parameter B can be used to control the slack effect
(similar as parameter C in SVMs). The above problem is
convex with convex inequality constraints. Hence, any
local minimum is also a global minimum. Furthermore,
under certain conditions on the structure of the Bayesian
network, the (typically) subnormalized parameter vector
w of a solution allows for renormalization without
changing the decision function P(cx1:N) (see [4], [5] for
details).
There are many possibilities to solve the optimization
problem in Eq. (9). Any minimization method allowing
for a nonlinear objective function and nonlinear convex
inequality constraints can be used in principle. We de
cided to use the large scale solver IPOPT [40] which
shows good performance in several applications (see
e.g. [41]).10IPOPT applies an interiorpoint method [43]
to solve the problem in (9). It requires the objective func
tion, its gradient, the constraint functions, the Jacobian
of the constraint functions, and the second derivative
of the Lagrangian function. To ensure short runtimes
and good results we used an adaptive strategy for the
barrier parameter, and let the algorithm run for up to 100
iterations or until sufficient precision was achieved.11We
refer to solutions obtained by IPOPT as CVXMM.
5.3.2Experimental Comparison
Table 6 and Table 7 show the classification rates and run
times for the different algorithms and datasets, respec
tively. The classification rates of CVXMM are slightly
10. We used IPOPT 3.9.2 in conjunction with MUMPS 4.9.2 [42], a
parallel sparse direct solver. IPOPT was compiled with Lapack 3.2.1
and BLAS from the Netlib repository (version from March 2007). IPOPT
is typically faster than the function fmincon of MATLAB for this type
of optimization problem.
11. Good classifiers do not require highly accurate solutions. Hence,
the tolerance for the objective function is set to 10−1– reducing the
runtime of the algorithm compared to using default tolerance settings.
better than those of CGMM, while the proposed algo
rithm CGMM12is up to orders of magnitude faster.
For USPS the training data is separable with large
margin by a NB classifier, i.e. there exists a probability
distribution that factors according to a NB network
for which all samples in the training set are classified
correctly and for which samples from different classes
are separated by a large margin. Therefore, an optimal
solution of (9) has small objective for a large range of the
parameter B. This complicates the choice of B as well
as the tolerance settings for the interiorpoint optimizer
(the optimization problem has to be solved with high
precision while keeping the optimization tractable). In
our experiments we were not able to find a setting
such that the achieved classification rate on the test set
is larger than 90.90% which is much smaller than the
classification rate of CGMM.
The large computational requirements of CVXMM are
caused by the convex formulation in Eq. (9): there is
one inequality for each conditional probability of the
network and for every additional training sample the
number of inequalities increases by C, i.e. the number
of classes. Further, there is an additional slack variable
resulting in an increase of the dimension of the search
space. The used test sets were the same as described
above. The runtime experiments were performed on
a personal computer with 2.8 GHz CPUs, 16 GB of
memory, not exploiting any (multicore) parallelization.
Furthermore, we fixed the regularization parameter B
to 1 for the MNIST data because of time reasons, but
selected it using cross tuning for the TIMIT4/6 and
USPS data.
TABLE 6
Classification rate (CR) in [%] for different data using a
naive Bayes classifier.
Parameter Learning
ML
CGMM
CR
CR
83.73
91.82
87.10
95.23
87.90
92.09
88.69
92.97
87.67
91.57
81.82
85.43
82.26
86.20
81.93
84.85
CVXMM
Data
MNIST
USPS
Ma+Fe4
Ma4
Fe4
Ma+Fe6
Ma6
Fe6
CR
92.04
90.90
92.31
93.09
91.82
85.61
86.67
85.46
B
1
1
3.9·10−3
3.9·10−3
3.9·10−3
3.9·10−3
3.9·10−3
3.9·10−3
Convex relaxation for margin optimization is interest
ing due to its sound theoretical background. However,
without further algorithmic developments its practical
application seems to be limited to applications using
only few training data and a low number of features.
In contrast, the proposed method for maximum margin
parameter learning can deal with large sets of training
data efficiently and achieves comparable classification
rates. Furthermore, we observed superior runtime per
12. CGMM is implemented in MATLAB.
Page 12
12
TABLE 7
Runtimes in [s] for different data using a naive Bayes
classifier (B as in Table 6).
Parameter Learning
CGMM
833
113
391
168
87
202
241
108
Data
MNIST
USPS
Ma+Fe4
Ma4
Fe4
Ma+Fe6
Ma6
Fe6
CVXMM
54 hours
21 hours
1338
842
844
4566
3505
3002
formance of the proposed method in the conducted
experiments.
6CONCLUSION
We derived a discriminative parameter learning algo
rithm for Bayesian network classifiers based on maximiz
ing the margin. For margin optimization we introduced
a conjugate gradient algorithm. In contrast to previous
work on margin optimization in probabilistic models,
we kept the sumtoone constraint which maintains the
probabilistic interpretation of the network, e.g. sum
mation over missing variables is still possible. In the
experiments, we treat two cases of discriminative param
eter learning – both optimization criteria (CL or MM)
were optimized with the CG method. Furthermore, we
applied various parameter learning algorithms on naive
Bayes and generatively and discriminatively optimized
TAN structures. Discriminative parameter learning sig
nificantly outperforms ML parameter estimation. Fur
thermore, maximizing the margin slightly improves the
classification performance compared to CL parameter
optimization in most cases.
Additionally, we provided empirical results for a max
imum margin optimization approach based on convex
relaxation. The classification results of both maximum
margin parameter learning approaches are almost iden
tical, whereas the computational requirements of our
CGbased optimization are up to orders of magnitude
lower. Marginoptimized Bayesian networks perform on
par with SVMs in terms of classification rate, however
the Bayesian network classifiers require fewer parame
ters than the SVM and can directly deal with missing
features, a case where discriminative classifiers usually
require imputation techniques.
ACKNOWLEDGMENTS
The authors thank the anonymous reviewers for use
ful comments that improved the quality of the paper.
Thanks to Jeff Bilmes for discussions and support in
writing this paper.
APPENDIX: CL PARAMETER LEARNING
The CG algorithm relies on the gradient of the objective
function given as
∂CLL(BS)
∂θj
ih
2
6
=
M
X
m=1
6
6
4
∂
∂θj
ih
logPΘ(cm,xm
1:N) −
C
P
c=1
∂
ihPΘ(c,xm
∂θj
1:N)
C
P
c=1PΘ(c,xm
1:N)
3
7
7
7
5.
Similar as in Section 3.3, we distinguish two cases for
differentiating CLL(BS), i.e. either C = i for j = 1
(Case 1) or C = h1for j > 1 (Case 2).
Case 1: For the class variable, i.e. j = 1 and h = ∅, we
get
∂CLL(BS)
∂θ1
i
=
M
X
m=1
»u1,m
i
θ1
i
−Wm
i
θ1
i
–
,
using Eq. (1) for differentiating the first term (omitting
the sum over j and h) and we introduced the class
posterior Wm
i
= PΘ(ixm
PΘ(i,xm
P
Case 2: For the attribute variables, i.e. j > 1, we
differentiate correspondingly and have
"uj,m
1:N) as
Wm
i
=
1:N)
C
c=1PΘ(c,xm
1:N)
.
∂CLL(BS)
∂θj
ih
=
M
X
m=1
ih
θj
ih
− Wm
h1
vj,m
ih\h1
θj
ih
#
,
where Wm
and sample m and vj,m
h1= PΘ(h1xm
1:N) is the posterior for class h1
ih\h1= 1 1n
zm
j=i and zm
Πj=h\h1
o.
The conditional log likelihood given in Eq. (2) can be
optimized by a conjugate gradient algorithm using line
search in a similar manner as given in Section 3.2. Again,
we reparameterize the problem to incorporate the con
straints on θj
ihin the conjugate gradient algorithm. This
requires the gradient of CLL(BS) with respect to βj
which is computed using the chain rule as
ih
∂CLL(BS)
∂β1
i
=
Zj
X
M
X
k=1
∂CLL(BS)
∂θ1
k
∂θ1
∂β1
k
i
=
m=1
ˆu1,m
i
− Wm
i
˜− θ1
i
M
X
m=1
C
X
c=1
ˆu1,m
c
− Wm
c
˜
for Case 1. Similarly for Case 2, the gradient is
∂CLL(BS)
∂βj
ih
=
M
X
m=1
h
uj,m
ih− Wm
h1vj,m
ih\h1
i
− θj
ih
M
X
m=1
Zj
X
l=1
h
uj,m
lh− Wm
h1vj,m
lh\h1
i
.
Page 13
13
REFERENCES
[1]Y. Guo, D. Wilkinson, and D. Schuurmans, “Maximum margin
Bayesian networks,” in International Conference on Uncertainty in
Artificial Intelligence (UAI), 2005, pp. 233–242.
V. Vapnik, Statistical learning theory.
B. Taskar, C. Guestrin, and D. Koller, “Maxmargin Markov
networks,” in Advances in Neural Information Processing Systems
(NIPS), 2003.
H. Wettig, P. Gr¨ unwald, T. Roos, P. Myllym¨ aki, and H. Tirri,
“When discriminative learning of bayesian network parameters
is easy,” in International Joint Conference on Artificial Intelligence
(IJCAI), 2003, pp. 491 – 496.
T. Roos, H. Wettig, P. Gr¨ unwald, P. Myllym¨ aki, and H. Tirri, “On
discriminative Bayesian network classifiers and logistic regres
sion,” Machine Learning, vol. 59, pp. 267–296, 2005.
F. Sha and L. Saul, “Comparison of large margin training to
other discriminative methods for phonetic recognition by hidden
Markov models,” in IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP), 2007, pp. 313–316.
G. Heigold, T. Deselaers, R. Schl¨ uter, and H. Ney, “Modified
MMI/MPE: A direct evaluation of the margin in speech recogni
tion,” in International Conference on Machine Learning (ICML), 2008,
pp. 384–391.
R. Collobert, F. Siz, J. Weston, and L. Bottou, “Trading convexity
for scalability,” in International Conference on Machine Learning
(ICML), 2006, pp. 201–208.
C. Bishop, Neural networks for pattern recognition. Oxford Univer
sity Press, 1995.
[10] R. Greiner, X. Su, S. Shen, and W. Zhou, “Structural extension
to logistic regression: Discriminative parameter learning of belief
net classifiers,” Machine Learning, vol. 59, pp. 297–322, 2005.
[11] O. Gopalakrishnan, D. Kanevsky, A. N` adas, and D. Nahamoo,
“An inequality for rational functions with applications to some
statistical estimation problems,” IEEE Transactions on Information
Theory, vol. 37, no. 1, pp. 107–113, 1991.
[12] F. Pernkopf and M. Wohlmayr, “On discriminative parameter
learning of Bayesian network classifiers,” in European Conference
on Machine Learning (ECML), 2009, pp. 221–237.
[13] P. Woodland and D. Povey, “Large scale discriminative training of
hidden Markov models for speech recognition,” Computer Speech
and Language, vol. 16, pp. 25–47, 2002.
[14] R. Schl¨ uter, W. Macherey, M. B., and H. Ney, “Comparison of
discriminative training criteria and optimization methods for
speech recognition,” Speech Communication, vol. 34, pp. 287–310,
2001.
[15] F. Pernkopf and M. Wohlmayr, “Maximum margin Bayesian
network classifiers,” Institute of Signal Processing and Speech
Communication, Graz University of Technology, Tech. Rep., 2010.
[16] ——, “Large margin learning of Bayesian classifiers based on
gaussian mixture models,” in European Conference on Machine
Learning (ECML), 2010, pp. 50–66.
[17] L. Lamel, R. Kassel, and S. Seneff, “Speech database development:
Design and analysis of the acousticphonetic corpus,” in DARPA
Speech Recognition Workshop, Report No. SAIC86/1546, 1986.
[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased
learning applied to document recognition,” Proceedings fo the IEEE,
vol. 86, no. 11, pp. 2278–2324, 1998.
[19] J. Pearl, Probabilistic reasoning in intelligent systems: Networks of
plausible inference.Morgan Kaufmann, 1988.
[20] F. Pernkopf and J. Bilmes, “Efficient heuristics for discriminative
structure learning of Bayesian network classifiers,” Journal of
Machine Learning Research, vol. 11, pp. 2323–2360, 2010.
[21] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network
classifiers,” Machine Learning, vol. 29, pp. 131–163, 1997.
[22] P. Domingos and M. Pazzani, “On the optimality of the simple
Bayesian classifier under zeroone loss,” Machine Learning, vol. 29,
no. 23, pp. 103–130, 1997.
[23] J. Bilmes, “Dynamic Bayesian multinets,” in 16th Inter. Conf. of
Uncertainty in Artificial Intelligence (UAI), 2000, pp. 38–45.
[24] R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhalter, Proba
bilistic networks and expert systems.
[25] C. Bishop, Pattern recognition and machine learning. Springer, 2006.
[26] S. Acid, L. de Campos, and J. Castellano, “Learning Bayesian
network classifiers: Searching in a space of partially directed
acyclic graphs,” Machine Learning, vol. 59, pp. 213–235, 2005.
[2]
[3]
Wiley & Sons, 1998.
[4]
[5]
[6]
[7]
[8]
[9]
Springer Verlag, 1999.
[27] B. Sch¨ olkopf and A. Smola, Learning with kernels: Support Vector
Machines, regularization, optimization, and beyond. MIT Press, 2001.
[28] P. Huber, “Robust estimation of a location parameter,” Annals of
Statistics, vol. 53, pp. 73–101, 1964.
[29] O. Chapelle, “Training a support vector machine in the primal,”
Neural Computation, vol. 19, no. 5, pp. 1155–1178, 2007.
[30] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical
recipes in C.Cambridge Univ. Press, 1992.
[31] T. Cover and J. Thomas, Elements of information theory. John Wiley
& Sons, 1991.
[32] E. Keogh and M. Pazzani, “Learning augmented Bayesian classi
fiers: A comparison of distributionbased and classificationbased
approaches,” in Workshop on Artificial Intelligence and Statistics,
1999, pp. 225–230.
[33] F. Pernkopf, “Bayesian network classifiers versus selective kNN
classifier,” Pattern Recognition, vol. 38, no. 3, pp. 1–10, 2005.
[34] D. Grossman and P. Domingos, “Learning Bayesian network
classifiers by maximizing conditional likelihood,” in Inter. Conf.
of Machine Lerning (ICML), 2004, pp. 361–368.
[35] P. Bartlett, M. Jordan, and J. McAuliffe, “Convexity, classification,
and risk bounds,” Journal of the American Statistical Association, vol.
101, no. 473, pp. 138–156, 2006.
[36] F. Pernkopf and M. Wohlmayr, “Stochastic marginbased structure
learning of Bayesian network classifiers,” Laboratory of Signal
Processing and Speech Communication, Graz University of Tech
nology, Tech. Rep., 2011.
[37] F. Pernkopf and J. Bilmes, “Orderbased discriminative structure
learning for Bayesian network classifiers,” in International Sympo
sium on Artificial Intelligence and Mathematics, 2008.
[38] U.FayyadandK. Irani,
continuousvalued attributes for classification learning,” in Joint
Conf. on Artificial Intelligence, 1993, pp. 1022–1027.
[39] F. Pernkopf, T. Van Pham, and J. Bilmes, “Broad phonetic classi
fication using discriminative Bayesian networks,” Speech Commu
nication, vol. 143, no. 1, pp. 123–138, 2008.
[40] A. W¨ achter and L. Biegler, “On the implementation of an interior
point filter linesearch algorithm for largescale nonlinear pro
gramming,” Mathematical Programming, vol. 106, pp. 25–57, 2006.
[41] L. Biegler and V. Zavala, “Largescale nonlinear programming
using IPOPT: An integrating framework for enterprisewide dy
namic optimization,” Computers & Chemical Engineering, vol. 33,
no. 3, pp. 575–582, 2009.
[42] P. Amestoy, I. Duff, J.Y. L’Excellent, and J. Koster, “MUMPS:
A general purpose distributed memory sparse solver,” in 5th
International Workshop on Applied Parallel Computing.
Verlag, 2000, pp. 122–131.
[43] S. Boyd and L. Vandenberghe, Convex Optimization.
University Press, March 2004.
“Multiinterval discretizatonof
Springer
Cambridge
Franz Pernkopf received his MSc (Dipl. Ing.)
degree in Electrical Engineering at Graz Uni
versity of Technology, Austria, in summer 1999.
He earned a PhD degree from the University
of Leoben, Austria, in 2002. In 2002 he was
awarded the Erwin Schr¨ odinger Fellowship. He
was a Research Associate in the Department of
Electrical Engineering at the University of Wash
ington, Seattle, from 2004 to 2006. Currently,
he is Associate Professor at the Laboratory of
Signal Processing and Speech Communication,
Graz University of Technology, Austria. His research interests include
machine learning, discriminative learning, graphical models, feature
selection, finite mixture models, and image and speech processing
applications.
Page 14
14
Michael Wohlmayr graduated from Graz Uni
versity of Technology in June 2007. He con
ducted his Master thesis in collaboration with
University of Crete, Greece. Since October
2007, he is pursuing the PhD degree at the
Signal Processing and Speech Communication
Laboratory at Graz University of Technology. His
research interests include Bayesian networks,
speech and audio analysis, as well as statistical
pattern recognition.
Sebastian Tschiatschek received the BSc de
gree and MSc degree (with distinction) in Elec
trical Engineering at Graz University of Tech
nology (TUG) in 2007 and 2010, respectively.
He conducted his Master thesis during a one
year stay at ETH Z¨ urich, Switzerland. Currently,
he is with the Signal Processing and Speech
Communication Laboratory at TUG where he is
pursuing the PhD degree. His research interests
include Bayesian networks, information theory in
conjunction with graphical models and statistical
pattern recognition.
View other sources
Hide other sources
 Available from Franz Pernkopf · May 26, 2014
 Available from tugraz.at