Page 1

1

Maximum Margin Bayesian Network Classifiers

Franz Pernkopf, Member, IEEE, Michael Wohlmayr, Student Member, IEEE,

Sebastian Tschiatschek, Student Member, IEEE

Abstract—We present a maximum margin parameter learning algorithm for Bayesian network classifiers using a conjugate gradient

(CG) method for optimization. In contrast to previous approaches, we maintain the normalization constraints of the parameters of

the Bayesian network during optimization, i.e. the probabilistic interpretation of the model is not lost. This enables to handle missing

features in discriminatively optimized Bayesian networks. In experiments, we compare the classification performance of maximum

margin parameter learning to conditional likelihood and maximum likelihood learning approaches. Discriminative parameter learning

significantly outperforms generative maximum likelihood estimation for naive Bayes and tree augmented naive Bayes structures on

all considered data sets. Furthermore, maximizing the margin dominates the conditional likelihood approach in terms of classification

performance in most cases. We provide results for a recently proposed maximum margin optimization approach based on convex

relaxation [1]. While the classification results are highly similar, our CG-based optimization is computationally up to orders of magnitude

faster. Margin-optimized Bayesian network classifiers achieve classification performance comparable to support vector machines

(SVMs) using a fewer number of parameters. Moreover, we show that unanticipated missing feature values during classification can

be easily processed by discriminatively optimized Bayesian network classifiers, a case where discriminative classifiers usually require

mechanisms to complete unknown feature values in the data first.

Index Terms—Bayesian network classifier, discriminative learning, discriminative classifiers, large margin training, missing features,

convex relaxation.

!

1

In statistical learning theory, the PAC bound on the

expected risk for unseen data depends on the empirical

risk on the training data and a measure for the general-

ization ability of the empirical model which is directly

related to the Vapnik-Chervonenkis (VC) dimension [2].

One of the most successful discriminative classifiers,

namely the support vector machine (SVM), finds a de-

cision boundary which maximizes the margin between

samples of distinct classes resulting in good general-

ization properties of the classifier. In contrast, conven-

tional discriminative training methods that rely on the

conditional likelihood (CL) optimize only the empirical

risk, which is suboptimal. Taskar et al. [3] observed that

undirected graphical models can be efficiently trained

to maximize the margin. More recently, Guo et al. [1]

introduced the maximization of the margin to Bayesian

networks using convex optimization. Unlike in undi-

rected graphical models, the main difficulty for Bayesian

networks is maintaining the normalization constraints

of the local conditional probabilities during parameter

learning. In [1], these constraints are relaxed to obtain a

convex optimization problem. However, conditions on

the graph structure are given, ensuring that the class

posterior of the relaxed problem is unchanged in case

of re-normalization [4], [5]. Unfortunately, classification

INTRODUCTION

• F. Pernkopf, M. Wohlmayr, and S. Tschiatschek are with the Department

of Electrical Engineering, Laboratory of Signal Processing and Speech

Communication, Graz University of Technology, Austria.

E-mail:pernkopf@tugraz.at,

tschiatschek@tugraz.at

michael.wohlmayr@tugraz.at,

This work was supported by the Austrian Science Fund (Project number

P22488-N23) and (Project number S10610).

results for this algorithm have only been demonstrated

on small-scale experiments. Since then, different margin-

based training algorithms have been proposed for hid-

den Markov models in [6], [7] and references therein.

Compared to [1], we maximize the margin in Bayesian

network classifiers using a different approach. We keep

the sum-to-one constraints which maintain the proba-

bilistic interpretation of the network. This has the par-

ticular advantage that summing over missing variables

is still possible (as we show in this paper). However,

we no longer have a convex optimization problem.

Convex problems are desirable in many cases as any

local optimum is a global optimum. Collobert et al. [8]

show that the optimization of non-convex loss functions

in SVMs can lead to sparse solutions (lower number of

support vectors) and accelerated training performance.

They conclude that the sacrosanct popularity of convex

approaches should not pre-empt the exploration of al-

ternative techniques, since they may offer computational

advantages. Similar observations are reported in [7] and

in this article.

In this paper, we introduce maximum margin (MM)

parameter learning for Bayesian network classifiers us-

ing a conjugate gradient (CG) method [9]. We treat two

cases of discriminative parameter learning: both opti-

mization criteria (CL or MM) are optimized using a CG

algorithm. CG-based CL learning for Bayesian networks

has been introduced in [10]. Recently, we proposed to

use the extended Baum-Welch (EBW) algorithm [11] for

optimizing the CL of Bayesian network classifiers [12].

In the speech community, the EBW algorithm is well-

known for optimizing the CL of hidden Markov mod-

els [11], [13]. EBW offers an EM-like parameter update.

Page 2

2

In fact, it is shown in [14] that the EBW algorithm resem-

bles the gradient descent algorithm for discriminatively

optimizing Gaussian mixtures using a particular step

size choice in the gradient descent method. In [15], we

attempted to use EBW for MM parameter optimization

of Bayesian network classifiers. We empirically observed

similar results as for CG-based optimization, however

the EBW requires a rational objective function which

can not be guaranteed anymore. Similarly, we introduced

maximum margin learning to Gaussian mixture models

using the EBW algorithm [16].

In experiments, we compare the classification perfor-

mance of generative maximum likelihood (ML) and dis-

criminative MM and CL parameter learning approaches.

We show that maximizing the margin dominates the con-

ditional likelihood approach with respect to classification

performance for most cases. Furthermore, we provide

results for maximum margin optimization using convex

relaxation [1]. We achieve highly similar classification

rates, whereas our CG-based margin optimization is

computationally dramatically less costly. All Bayesian

network classifiers use either naive Bayes (NB) or gener-

atively and discriminatively optimized1tree augmented

naive Bayes (TAN) structures. We also provide results

for SVMs showing that margin-optimized Bayesian net-

work classifiers are serious competitors – especially in

cases where small-sized and probabilistic models are re-

quired. Moreover, we show experiments demonstrating

the ability of handling missing feature scenarios. We are

particularly interested in situations where unanticipated

missing feature values arise during classification, i.e.

during testing, which can be easily handled by our

discriminatively optimized Bayesian network classifiers.

Discriminative models usually require mechanisms to

first complete unknown feature values in the data –

known as data imputation – and then applying the stan-

dard classification approach to the completed data. We

provide results for two imputation techniques, namely

(i) mean value imputation, i.e. the missing feature value

is replaced with the mean value of the feature over the

entire training data set; (ii) k-nearest neighbor (kNN)

value imputation, i.e. the mean value (for discretized

data the most frequent value) of the k-nearest neighbors

is used as surrogate of the missing feature value. kNN

feature value imputation is slow and requires the train-

ing data to be available during classification.

The paper is organized as follows: In Section 2, we

introduce our notation and briefly review Bayesian net-

works, ML parameter learning as well as NB and TAN

structures. In Section 3, we introduce MM parameter

learning. Section 4 summarizes a generative and two

discriminative structure learning algorithms used in the

experiments. In Section 5, we present experimental re-

sults for phonetic classification using the TIMIT speech

1. By “discriminative structure learning”, we mean that the aim of

optimization is to learn the structure of the network by maximizing a

cost function that is suitable for reducing classification errors, such as

conditional likelihood or classification rate.

corpus [17], for handwritten digit recognition using the

MNIST [18] and USPS data sets, and for a remote sens-

ing application. Furthermore, experiments for missing

feature situations are reported in Section 5.1 and 5.2. In

Section 5.3, we show results for margin-based Bayesian

network parameter optimization using convex relaxation

and provide the runtime for each of the maximum

margin parameter learning algorithms. Finally, Section 6

concludes the paper.

2BAYESIAN NETWORK CLASSIFIERS

A Bayesian network [19] B = ?G,Θ? is a directed acyclic

graph G = (Z,E) consisting of a set of nodes Z and a

set of directed edges E connecting the nodes. This graph

represents factorization properties of the distribution of a

set of random variables Z = {Z1,...,ZN+1}, where |Zj|

denotes the cardinality of Zj. The variables in Z have val-

ues denoted by lower case letters z = {z1,z2,...,zN+1}.

We use boldface capital letters, e.g. Z, to denote a

set of random variables and correspondingly boldface

lower case letters, e.g. z, denote a set of instantiations

(values). Without loss of generality, in Bayesian network

classifiers the random variable Z1 represents the class

variable C ∈ {1,...,|C|}, where |C| is the number of

classes and X1:N = {X1,...,XN} = {Z2,...,ZN+1}

denotes the set of random variables which model the

N attributes of the classifier. In a Bayesian network

each node is independent of its non-descendants given

its parents. Conditional independencies among variables

reduce the computational effort for exact inference on

such a graph. The set of parameters which quantify

the network is represented by Θ. Each node Zj is rep-

resented as a local conditional probability distribution

given its parents ZΠj. We use θj

conditional probability table entry (assuming discrete

variables); the probability that variable Zj takes on its

ithvalue assignment given that its parents ZΠjtake their

hth(lexicographically ordered) assignment, i.e. θj

PΘ

figuration assuming that the first element of h denoted as

h1is the conditioning class and the remaining elements

h\h1 are the conditioning parent attribute values. The

training data consists of M independent and identically

distributed samples S = {zm}M

where M = |S|. The joint probability distribution of a

sample zmis determined as

i|hto denote a specific

i|h=

?Zj= i|ZΠj= h?. Hence, h contains the parent con-

m=1= {(cm,xm

1:N)}M

m=1

PΘ(Z = zm) =

N+1

Y

N+1

Y

j=1

PΘ

“

Y

Zj = zm

j|ZΠj= zm

Πj

”

=

j=1

|Zj|

Y

i=1

h

“

θj

i|h

”uj,m

i|h,

(1)

where we use uj,m

form, i.e. uj,m

i|hto represent the mthsample in binary

i|h= 1 1n

zm

notes the indicator function, i.e. it equals 1 if the Boolean

j=i and zm

Πj=h

o. Symbol 1 1{i=j}de-

Page 3

3

expression i = j is true and 0 otherwise. The class labels

are predicted using the maximum a-posteriori (MAP)

estimate obtained by Bayes rule, i.e.

PΘ(C = c|X1:N = xm

1:N) =

PΘ(C = c,X1:N = xm

P|C|

1:N)

c′=1PΘ(C = c′,X1:N = xm

1:N)

,

where the most likely class c∗is determined as c∗=

argmaxc′∈{1,...,|C|}PΘ(C = c′|X1:N= xm

For the sake of brevity, we only notate instantiations

of the random variables in the sequel.

1:N).

2.1 Generative ML Parameter Learning

The log likelihood function of a fixed structure of B is

LL(B|S) =

M

X

m=1

N+1

X

j=1

|Zj|

X

i=1

X

h

uj,m

i|hlog

“

θj

i|h

”

.

Maximizing LL(B|S) leads to the ML estimate of the

parameters

PM

m=1

θj

i|h=

m=1uj,m

P|Zj|

i|h

PM

l=1uj,m

l|h

,

using Lagrange multipliers to constrain the parame-

ters to a valid normalized probability distribution, i.e.

?|Zj|

2.2Discriminative CL Parameter Learning

i=1θj

i|h= 1.

Maximizing CL is tightly connected to minimizing the

empirical risk. Unfortunately, CL does not decompose as

ML does. Consequently, there is no closed-form solution.

The conditional log likelihood (CLL) is

CLL(B|S) = log

M

Y

2

m=1

PΘ(cm|xm

1:N)

(2)

=

M

X

m=1

4logPΘ(cm,xm

1:N) − log

|C|

X

c=1

PΘ(c,xm

1:N)

3

5.

A conjugate gradient algorithm [10], [20] or the EBW

method [12] have been proposed for maximizing

CLL(B|S). For the sake of completeness, we shortly

sketch the CG algorithm for CL optimization in the

Appendix.

2.3 Structures

In this work, we restrict our experiments to NB and

TAN structures defined in the next paragraphs. The NB

network assumes that all the attributes are condition-

ally independent given the class label. This means that,

given C, any subset of X is independent of any other

disjoint subset of X. As reported in the literature [21],

[22], the performance of the NB classifier is surprisingly

good even if the conditional independence assumption

between attributes is unrealistic or even false in most

of the data. Reasons for the utility of the NB classifier

range between benefits from the bias/variance tradeoff

perspective [21] to structures that are inherently poor

from a generative perspective but good from a discrimi-

native perspective [23]. The structure of the naive Bayes

classifier represented as a Bayesian network is illustrated

in Figure 1(a).

(a)

C

X1

X2

X3

XN

(b)

C

X1

X2

X3

XN

Fig. 1. Bayesian Network: (a) NB, (b) TAN.

In order to overcome some of the limitations of the

NB classifier, Friedman et al. [21] introduced the TAN

classifier. A TAN is based on structural augmentations

of the NB network: Additional edges are added between

attributes. Each attribute may have at most one other

attribute as an additional parent which means that the

tree-width of the attribute induced sub-graph is unity2,

i.e. we have to learn a 1-tree over the attributes. The

maximum number of edges added to relax the indepen-

dence assumption between the attributes is N −1. Thus,

two attributes might not be conditionally independent

given the class label in a TAN. An example of a TAN

network is shown in Figure 1(b).

A TAN network is typically initialized as a NB net-

work and additional edges between attributes are de-

termined through structure learning. Hence, TAN struc-

tures are restricted such that the class node remains

parent-less, i.e. CΠ = ∅. An extension of the TAN

network is to use a k-tree, i.e. each attribute can have

a maximum of k attribute nodes as parents. In [20], we

noticed that 2-trees over the features do not improve

classification performance significantly without regular-

ization. Therefore, we limit the experiments to NB and

TAN structures. Many other network topologies have

been suggested in the past – a good overview is provided

in [26].

2. The tree-width of a graph is defined as the size (i.e. number

of variables) of the largest clique of the moralized and triangulated

directed graph minus one. Since there are commonly multiple trian-

gulated graphs, the tree-width is defined by the triangulation where

the largest clique has the fewest number of variables. More details are

given in [24], [25] and references therein.

Page 4

4

3

TER LEARNING

DISCRIMINATIVE MARGIN-BASED PARAME-

The proposed CG-based maximum margin learning al-

gorithm is developed in the following sections.

3.1 Maximum Margin Objective Function

The multi-class margin [1] of sample m can be expressed

as

˜dm

Θ= min

c?=cm

PΘ(cm|xm

PΘ(c|xm

1:N)

1:N)

=

PΘ(cm,xm

maxc?=cm PΘ(c,xm

1:N)

1:N).

(3)

Sample m is correctly classified if and only if˜dm

We replace the maximum operator by the differentiable

softmax function maxxf(x) ≈ log[?

rameterized by η, where η ≥ 1 and f (x) is non-

negative [6]. In the limit of η → ∞ the approximation

approaches the maximum operator.3Using this we can

define the approximate multi-class margin dm

the logarithm we obtain

Θ> 1.

xexp(ηf(x))]

1

ηpa-

Θ. Taking

logdm

Θ= logPΘ(cm,xm

1:N) −1

ηlog

X

c?=cm

(PΘ(c,xm

1:N))η.

(4)

Usually, the maximum margin approach maximizes the

margin of the sample with the smallest margin for a

separable classification problem [27], i.e. the objective

is to maximize minm=1,...,Mlogdm

problem, we aim to relax this by introducing a soft mar-

gin, i.e. we focus on samples with logdm

For this purpose, we consider the hinge loss function

Θ. For a non-separable

Θclose to zero.

f

M (B|S) =

M

X

m=1

min(1,λlogdm

Θ),

where the scaling parameter λ > 0 controls the margin

with respect to the loss function and is set by cross-

validation. Maximizing this function with respect to

the parameters Θ implicitly increases the log-margin,

whereas the emphasis is on samples with λlogdm

i.e. samples with a large positive margin have no impact

on the optimization. Maximizing?

the derivative at λlogdm

Θ= 1. Therefore, we propose to

use a smooth hinge function hκ(y) inspired by the Huber

loss [28] which is differentiable in R and has a similar

shape as min[1,y]:

Θ< 1,

M (B|S) using CG is

not straight forward due to the non-differentiability in

hκ(y) =

8

>

>

:

<

y + κ,

if y ≤ 1 − 2κ,

if 1 − 2κ < y < 1, and

if y ≥ 1,

1 −

1,

(y−1)2

4κ

,

(5)

3. Empirical results showed that the performance of the algorithm is

not sensitive to the choice of η for η ≥ 5. The case η = 1 resembles the

classical softmax function which empirically showed a slightly inferior

performance.

where κ parameterizes this loss function. For κ → 0 the

smooth hinge function approaches min(1,y). This func-

tion requires to divide the data S into three partitions

depending on ym= λlogdm

where ym≤ 1−2κ, S2

Θconsists of samples with a margin

in the range 1 − 2κ < ym< 1, and S3

The smooth hinge function hκ(y) parameterized by κ is

shown in Figure 2.

Θ, i.e. S1

Θcontains samples

Θ= S \?S1

Θ∪ S2

Θ

?.

−1−0.500.5

y

11.52

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

hκ(y)

hinge

smooth hinge (κ=0.5)

smooth hinge (κ=0.1)

quadratic

linear

Fig. 2. Differentiable approximation of the hinge loss for

κ = 0.5 and κ = 0.1.

Similar as in [29] we empirically identified typical

values of κ in the range between 0.01 and 0.5. Tuning

parameter κ in the given range has a moderate impact on

the performance (as we show in experiments). Hence, we

suggest to fix this parameter in case of time constraints.

Finally, using the introduced smooth hinge loss our

objective function for margin maximization is

M (B|S) =

X

m∈S1

Θ

X

(λlogdm

Θ+ κ)

+

m∈S2

Θ

»

1 −(λlogdm

Θ− 1)2

4κ

–

+ |S3

Θ|.

(6)

This function is differentiable and can be optimized by

CG methods.

3.2 CG Algorithm

We use a conjugate gradient algorithm with line-search

[30] which requires both the objective function (6) and

its derivative. In particular, the Polak-Ribiere method is

used [9]. The probability θj

and?|Zj|

the conjugate gradient algorithm we re-parameterize the

problem according to

i|his constrained to θj

i|h≥ 0

i=1θj

i|h= 1. To incorporate these constraints in

θj

i|h=

exp

“

βj

i|h

“

”

βj

P|Zj|

l=1exp

l|h

”,

Page 5

5

where βj

i|h∈ R is unconstrained. The CG algorithm

requires the gradient

∂βj

i|h

the chain rule as

∂M(B|S)

which is obtained using

∂M (B|S)

∂βj

i|h

=

|Zj|

X

k=1

∂M (B|S)

∂θj

k|h

∂θj

k|h

∂βj

i|h

.

(7)

3.3Derivatives

The derivative of

∂M(B|S)

∂Θ

in Eq. (7) is

∂M (B|S)

∂θj

i|h

=

M

X

m=1

sm,λ

Θ

∂ logdm

∂θj

Θ

i|h

,

where sm,λ

by

Θ

denotes a sample dependent weight given

sm,λ

Θ

=

8

>

>

:

<

λ,

−λ

0,

if m ∈ S1

if m ∈ S2

if m ∈ S3

Θ,

Θ, and

Θ.

2κ(λlogdm

Θ− 1),

(8)

When determining the derivative logdm

to distinguish among two cases: For TAN and NB

structures each parameter θj

value, either C = i for j = 1 or C = h1for j > 1 where h1

denotes the class instantiation h1∈ h. Due to this fact, at

most one summand is nonzero when differentiating the

term?

Case A: For the class variable, i.e. j = 1 and h = ∅, the

derivative of Eq. (4) after introducing the joint probabil-

ity of Eq. (1) results in

Θwe have

i|hinvolves the class node

c?=cm(PΘ(c,xm

1:N))ηin Eq. (4) with respect to θj

i|h.

∂ logdm

∂θ1

Θ

i

=u1,m

i

θ1

i

− 1 1{i?=cm}Vm

i

θ1

i

,

where we set Vm

i

to

Vm

i

=

[PΘ(i,xm

P

1:N)]η

|C|

c?=cm[PΘ(c,xm

1:N)]η

.

Case B: For the attribute variables, i.e. j > 1, we

differentiate correspondingly and have

∂ logdm

∂θj

Θ

i|h

=

uj,m

i|h

θj

i|h

− 1 1{h1?=cm}Vm

h1

vj,m

i|h\h1

θj

i|h

,

where vj,m

i|h\h1= 1 1n

zm

j=i and zm

∂M(B|S)

∂Θ

Πj=h\h1

for Case A and Case B is

o.

Hence, the gradient

∂M (B|S)

∂θ1

i

=

M

X

m=1

sm,λ

Θ

θ1

i

ˆu1,m

i

− 1 1{i?=cm}Vm

i

˜

and

∂M (B|S)

∂θj

i|h

=

M

X

m=1

sm,λ

Θ

θj

i|h

h

uj,m

i|h− 1 1{h1?=cm}Vm

h1vj,m

i|h\h1

i

,

respectively. These derivatives are further used in Eq. (7)

resulting in the required gradient for the CG algorithm.

Hence, for Case A we obtain

∂M (B|S)

∂β1

i

=

M

X

m=1

sm,λ

Θ

ˆu1,m

|C|

X

i

− 1 1{i?=cm}Vm

i

˜

− θ1

i

M

X

m=1

c=1

sm,λ

Θ

ˆu1,m

c

− 1 1{c?=cm}Vm

c

˜,

and for Case B we have

∂M (B|S)

∂βj

i|h

=

M

X

m=1

sm,λ

Θ

h

uj,m

i|h− 1 1{h1?=cm}Vm

h1vj,m

i|h\h1

i

− θj

i|h

M

X

m=1

|Zj|

X

l=1

sm,λ

Θ

h

uj,m

l|h− 1 1{h1?=cm}Vm

h1vj,m

l|h\h1

i

.

4STRUCTURE LEARNING

This section provides three structure learning heuristics

– one generative and two discriminative ones – used in

the experiments in Section 5. Note that the parameters

during structure learning are optimized generatively

using maximum likelihood estimation [19].

4.1 Generative Structure Learning

The conditional mutual information (CMI) [31] between

the attributes given the class variable is computed as:

I (Xi;Xj|C) = EP(Xi,Xj,C)

?

log

P (Xi,Xj|C)

P (Xi|C)P (Xj|C)

?

,

where EP(X)[f(X)] denotes the expectation of f(X)

with respect to P (X). It measures the information be-

tween Xiand Xj in the context of C. In [21], an algo-

rithm for constructing TAN networks using this measure

is provided. We briefly review this algorithm in the

following:

1) Compute the pairwise CMI I (Xi;Xj|C) for all 1 ≤

i ≤ N and i < j ≤ N.

2) Build an undirected 1-tree using the maximal

weighted spanning tree algorithm [19] where

each edge connecting Xi and Xj is weighted by

I (Xi;Xj|C).

3) Transform the undirected 1-tree into a directed tree.

That is, select a root variable and direct all edges

away from this root. Add to this tree the class

node C and the edges from C to all attributes

X1,...,XN.

This generative structure learning method is abbreviated

as CMI in the experiments.

Page 6

6

4.2Greedy Discriminative Structure Learning

This method proceeds as follows: a network is initialized

to NB and at each iteration an edge is added that

gives the largest improvement of the scoring function,

while maintaining a partial 1-tree. Basically, two scoring

functions have been considered: the classification rate

(CR) [32], [33]

CR(B|SV) =

1

MV

MV

X

m=1

1 1{cm=arg maxc′ PΘ(c′|xm

1:N)}

and the CL [34]

CL(B|SV) =

MV

Y

m=1

PΘ(cm|xm

1:N),

where SV = {(cm,xm

MV = |SV|.

The process of adding edges is terminated when there

is no edge which further improves the score. Thus, it

might result in a partial 1-tree (forest) over the attributes.

This approach is computationally expensive since each

time an edge is added, the scores for all O?N2?

need to be re-evaluated due to the discriminative non-

decomposable scoring functions we employ. Overall, for

learning a k-tree structure, O?Nk+2?

are necessary. In our experiments, we consider the CR

score which is directly related to the empirical risk in [2].

The CR is the discriminative criterion that, given suffi-

cient training data, most directly evaluates the objective

(small error rate), while an alternative would be to use a

convex upper-bound on the 0/1-loss function [35]. Since

we are optimizing over a constrained model space (k-

trees) regularization is implicit. The CR evaluation can

be accelerated by techniques presented in [20], [36]. In

the experiments this greedy heuristic is labeled as TAN-

CR for 1-tree structures.

Recently, the maximum margin score was introduced

for discriminatively optimizing the structure of Bayesian

network classifiers [36]. As a search heuristic simulated

annealing was used, which offers mechanisms to escape

from locally optimal solutions. The maximum margin

optimized Bayesian network structures achieve good

classification performance.

1:N)}MV

m=1is the validation data and

edges

score evaluations

4.3Order-based Discriminative Structure Learning

In [37], [20], an order-based greedy algorithm (OMI-CR)

has been introduced which is able to find a discrimina-

tive TAN structure with only O?N2?score evaluations.

The order-based algorithm consists of 2 steps:

1) Establishing an ordering: First, a total ordering

≺ of the variables X1:N according to the CMI is

established. The feature that is most informative

about C is selected first. The next node in the

order is the node that is most informative about

C conditioned on the first node. More specifically,

this step determines an ordered sequence of nodes

X1:N

≺≺

=?X1

Xj

≺← arg

≺,X2

≺,...,XN

?according to

?

max

X∈X1:N\X1:j−1

≺

I

?

C;X|X1:j−1

≺

??

,

where j ∈ {1,...,N}.

2) Selecting parents with respect to a given order to

form a k-tree: Once the variables are ordered X1:N

the parent XΠj ∈ XΠj = X1:j−1

(j ∈ {3,...,N}) is selected. In case of a small size

of XΠj(i.e. N) and of k a computational costly

scoring function to find XΠjcan be used. Basically,

either the CL or the CR can be used as cost function

to select the parents for learning a discriminative

structure. We restrict our experiments to CR for

parent selection (empirical results showed a better

performance). The parameters are trained using

ML learning. A parent is connected to Xj

when CR is improved. Otherwise Xj

entless (except C). This might result in a partial

1-tree (forest) over the attributes.

The classification results of the order-based greedy algo-

rithm are not statistical significantly different compared

to the greedy algorithm. Similarly, the SuperParent al-

gorithm [32] is almost as efficient as OMI-CR achieving

slightly lower classification performance [20].

≺ ,

≺

for each Xj

≺

≺only

≺is left par-

5EXPERIMENTS

We present results for frame-based phonetic classifica-

tion using the TIMIT speech corpus [17], for handwritten

digit recognition using the MNIST [18] and the USPS

data, and for a remote sensing application. In the fol-

lowing, we list the used structure learning algorithms

for TAN networks:

• TAN-CMI: Generative TAN structure learning using

conditional mutual information (CMI).

• TAN-CR: Discriminative TAN structure learning us-

ing the naive greedy heuristic.

• TAN-OMI-CR: Discriminative TAN structure learn-

ing using the efficient order-based heuristic.

Once the structure has been determined discrimina-

tive parameter learning is performed. This is either

done using the proposed CG algorithm to maximize the

margin, labeled as CG-MM (see Section 3), or the CL

method (see Section 2.2). Additionally, we show results

for margin-based Bayesian network optimization using

convex relaxation, denoted as CVX-MM, and provide the

computational costs for both algorithms.

The parameters are initialized to the ML estimates for

all discriminative parameter learning methods.4Similar

as in [10] we use cross tuning to estimate the optimal

number of iterations for the CG algorithm to avoid

overfitting. Additionally, the value of λ ∈ [0.001,...,0.5]

4. Empirical results showed that the initialization of the Bayesian

network to the ML estimates for MM or CL optimization performs

better than pure random initialization.

Page 7

7

and κ ∈ [0.01,...,0.5] resulting in the best classification

rate is obtained empirically using cross-tuning. We note

that instead of early stopping also regularization of the

parameters can be used to avoid over-training of the

models. In [5], concave priors have been suggested, how-

ever, ℓ1or ℓ2-regularization in the unconstrained space of

βj

i|his an alternative. In any case, a weight measuring the

trade-off between objective function and regularization

term has to be determined by cross-validation. So there

is no benefit. Empirically, we could not observe any

advantage of regularization over early stopping in terms

of achieved classification performance.

Continuous features were discretized using recursive

minimal entropy partitioning [38] where the quantiza-

tion intervals were determined using only the training

data. Zero probabilities in the conditional probability

tables are replaced with small values ε. Further, we

used the same data set partitioning for various learning

algorithms.

5.1

Classification

Handwritten Digit Recognition and Phonetic

5.1.1 Data Characteristics

In the following, we provide details about the used data

sets:

TIMIT-4/6 Data: This data set is extracted from the

TIMIT speech corpus using the dialect speaking region

4 which consists of 320 utterances from 16 male and 16

female speakers. Speech frames are classified into either

four or six classes using 110134 and 121629 samples,

respectively. Each sample is represented by 20 mel-

frequency cepstral coefficients (MFCCs) and wavelet-

based features. We perform classification experiments on

data of male speakers (Ma), female speakers (Fe), and

both genders (Ma+Fe), all in all resulting in 6 distinct

data sets (i.e. Ma, Fe, Ma+Fe × 4 and 6 classes). The

data have been split into 2 mutually exclusive subsets

where 70% is used for training and 30% for testing. More

details about the features can be found in [39].

MNIST Data: We present results for the handwritten

digit MNIST data [18] which contains 60000 samples for

training and 10000 digits for testing. We down-sample

the gray-level images by a factor of two which results in

a resolution of 14 × 14 pixels, i.e. 196 features.

USPS Data: This data set contains 11000 uniformly

distributed handwritten digit images from zip codes of

mail envelopes. The data set is split into 8000 images for

training and 3000 for testing. Each digit is represented

as a 16 × 16 grayscale image, where again each pixel is

considered as feature.

5.1.2 Results

Tables 1, 2, and

MNIST, USPS, and the six TIMIT-4/6 data sets for

3 show the classification rates for

various learning methods.5Additionally, we provide

classification performances for SVMs using a radial basis

function (RBF) kernel.6In particular, for TIMIT-4/6 we

only show results for the NB structure. The reason is that

the final step of MFCC feature extraction involves a dis-

crete cosine transform, i.e. the features are decorrelated.

Hence, we empirically observed that the independence

assumptions of the NB structure is a good choice for

these data sets.

TABLE 1

Classification results in [%] for MNIST data with standard

deviation. Best parameter learning results for each

structure are emphasized using bold font.

Parameter Learning

CG-MM

91.82±0.27

94.70±0.22

94.94±0.22

95.12±0.22

Classifier

NB

TAN-CMI

TAN-OMI-CR

TAN-CR

SVM (C∗= 1,σ = 0.01)

ML CG-CL

91.70±0.28

93.80±0.24

93.39±0.25

93.94±0.24

83.73±0.37

91.28±0.28

92.01±0.27

92.58±0.26

96.40±0.19

TABLE 2

Classification results in [%] for USPS data with standard

deviation. Best parameter learning results for each

structure are emphasized using bold font.

Parameter Learning

CG-MM

95.23±0.39

95.23±0.39

95.70±0.37

96.30±0.34

Classifier

NB

TAN-CMI

TAN-OMI-CR

TAN-CR

SVM (C∗= 1,σ = 0.005)

MLCG-CL

93.67±0.44

94.87±0.40

94.90±0.40

95.83±0.36

87.10±0.61

91.90±0.50

92.40±0.48

92.57±0.48

97.86±0.26

TABLE 3

Classification results in [%] for TIMIT-4/6 data with

standard deviation. Best results for each data set are

emphasized using bold font.

NBSVM

C∗= 1

σ = 0.05

92.49±0.14

93.30±0.20

92.14±0.21

86.24±0.18

87.19±0.25

86.19±0.25

Parameter Learning

CG-MM

92.09±0.15

92.97±0.20

91.57±0.21

85.43±0.18

86.20±0.26

84.85±0.26

Data

Ma+Fe-4

Ma-4

Fe-4

Ma+Fe-6

Ma-6

Fe-6

MLCG-CL

92.12±0.16

92.81±0.20

91.57±0.22

85.41±0.18

86.28±0.26

85.12±0.26

87.90±0.15

88.69±0.25

87.67±0.25

81.82±0.20

82.26±0.28

81.93±0.28

Average 84.8588.6788.69

89.38

The classification rate is improving for more complex

structures using ML parameter learning. Discrimina-

5. The average CR over the six TIMIT-4/6 data sets is determined

by weighting the CR of each data set with the number of samples in

the test set. These values are accumulated and normalized by the total

amount of samples in all test sets.

6. The SVM uses two parameters C∗and σ, where C∗is the penalty

parameter for the errors in the non-separable case and σ is the variance

parameter for the RBF kernel.

Page 8

8

tively optimized structures, i.e. TAN-OMI-CR and TAN-

CR significantly outperform generatively learned, i.e.

TAN-CMI and NB structures. Discriminative parameter

learning produces a significantly better classification per-

formance than ML parameter learning on the same clas-

sifier structure. This is especially valid for cases where

the structure of the underlying model is not optimized

for classification [10], i.e. NB and TAN-CMI.

MM parameter optimization outperforms CL learning

for most data sets. However, SVMs outperform our

discriminative Bayesian network classifiers on all data

sets. For TIMIT-4/6 one reason might be that SVMs are

applied to the continuous feature domain. In Table 4

we compare the model complexity, i.e. the number of

parameters, between SVMs and the best performing

Bayesian network classifier. This table reveals that the

Bayesian network uses ∼ 108, ∼ 66, ∼ 212, and ∼ 259

times fewer parameters than the SVM for MNIST, USPS,

Ma+Fe-4, and Ma+Fe-6, respectively. It is a well-known

fact that the number of support vectors in classical SVMs

increases linearly with the number of training sam-

ples [8]. In contrast, the structure of Bayesian network

classifiers naturally limits the number of parameters. A

substantial difference is that SVMs determine the num-

ber of support vectors automatically while in the case of

Bayesian networks the number of parameters is given

by the cardinality of the variables and the structure. In

this way, the model complexity can be easily controlled

by constraints on the structure. We use cross-tuning to

select C∗and σ for SVMs and parameter λ, κ, and the

number of CG iterations for MM learning of Bayesian

networks.

In contrast to SVMs, the used Bayesian network

structures are probabilistic generative models – even

when discriminatively learned. They might be preferred

since it is easy to work with missing features, domain

knowledge can be directly incorporated into the graph

structure, and it is easy to work with structured data. In

this paragraph, we demonstrate that a discriminatively

optimized generative model still offers its advantages

in the missing feature case. Our MM parameter learn-

ing keeps the sum-to-one constraint of the probability

distributions. Therefore, we suggest, similarly to the

generatively optimized models, to sum over the missing

feature values. The interpretation of marginalizing over

missing features is delicate since the discriminatively op-

timized parameters might not have anything in common

with consistently estimated probabilities (such as e.g.

maximum likelihood estimation). However, at least em-

pirically there is a strong support for using the density

P(C,X′) =?

of the features X1:N. This computation is tractable if the

complexity class of P(C,X1:N) is limited (e.g. 1-tree) and

the variable order in the summation is chosen appro-

priately. In contrast, classical discriminative models are

inherently conditional and it is not possible to obtain

p(C|X′) from p(C|X1:N). In particular, this holds for

SVMs, logistic regression, and multi-layered perceptrons.

X1:N\X′P(C,X1:N) where X′is a subset

These models commonly require imputation techniques

to first complete missing feature values in the data. Then

the classification approach is applied on the completed

data.

We are particularly interested in the case where ar-

bitrary sets of missing features for each classification

sample can occur during testing.7In such a case, it is

not possible to re-train the model for each potential

set of missing features without also memorizing the

training set. In Figure 3(a), we present the classification

performance of discriminative and generative structures

using ML parameter learning on the MNIST data as-

suming missing features. The x-axis denotes the number

of missing features. The curves are the average over

100 classifications of the test data with uniformly at

random selected missing features. We use exactly the

same missing features for each classifier. We observe that

discriminatively structured Bayesian network classifiers

outperform TAN-CMI-ML even in the case of missing

features. This demonstrates, at least empirically, that

discriminatively structured generative models do not

lose their ability to impute missing features.

In Figure 3(b), we show for the same data set and

experimental setup that the classification performance

of a discriminatively parameterized NB classifier may

be superior to a generatively parameterized NB model

in the case of missing features. In particular, this advan-

tage holds for up to ∼80 missing features. For a larger

number of missing features the performance of NB-ML

is more robust. Additionally, NB-CG-MM seems to be

more robust to increasing number of missing features

compared to NB-CG-CL. This can be attributed to the

better generalization property of a margin-optimized

classifier.

5.2 Remote Sensing

We use a hyperspectral remote sensing image of the

Washington D.C. Mall area containing 191 spectral bands

having a spectral width of 5-10 nm.8As ground reference

a classification performed at Purdue University was used

containing 7 classes, namely roof, road, grass, trees, trail,

water, and shadow.9The aerial image using bands 63,

52, and 36 for red, green, and blue colors, respectively,

and the reference image are shown in Figure 4(a) and

(b). The image contains 1280 × 307 hyperspectral pixels,

i.e. 392960 samples. We arbitrarily choose 5000 samples

of each class to learn the classifier. This remote sensing

application is in particular interesting for our classifiers

since spectral bands might be missing or should be ne-

glected due to atmospheric effects. For example radiation

within the visible range should be neglected in case of

clouds or darkness.

7. Note that we do not consider missing features during training of

the classifiers.

8. http://cobweb.ecn.purdue.edu/˜biehl/MultiSpec/hyperspectral.

html

9. http://cobweb.ecn.purdue.edu/˜landgreb/Hyperspectral.Ex.html

Page 9

9

TABLE 4

Model complexity for best Bayesian network (BN) and SVM.

Data

MNIST

USPS

N

196

256

20

20

191

Number of SVs

17201

3837

13146

24350

11934

Number of SVM parameters

3371396

982272

262920

487000

2279394

Number of BN parameters

31149

14689

1239

1877

62566

TIMIT-4/6 (Ma+Fe-4)

TIMIT-4/6 (Ma+Fe-6)

Washington D.C. Mall

(a)

(b)

Fig. 3. Classification performance on MNIST assuming

missing features. The x-axis denotes the number of miss-

ing features and the shaded regions correspond to the

standard deviation over 100 classifications: (a) Different

structure learning methods with generative parameter-

ization; (b) Different discriminative parameter learning

methods on NB structure.

We use various introduced generative and discrimina-

tive parameter learning algorithms on the NB network

structure. The classification performances are shown in

Table 5.

Remarkably, NB-CG-MM slightly outperforms SVMs

in this experiment. Additionally, the Bayesian network

employs ∼ 36 times fewer parameters than the SVM (see

Table 4). Figure 5 shows the influence of parameter κ in

the loss function (6) for λ = 0.02 on the classification

performance. The classification rate slightly improves

(a)(b)

Fig. 4. Washington D.C. Mall: (a) Pseudo color image of

spectral bands 63, 52, and 36; (b) Reference image.

TABLE 5

Classification results in [%] for Washington D.C. Mall data

with standard deviation. Best parameter learning result is

emphasized using bold font.

NB SVM

C∗= 1

σ = 0.05

88.98

±0.05

Parameter Learning

MLCG-MM

81.07

89.34

±0.06

±0.05

CG-CL

87.01

±0.05

for κ = 0.5. However, the impact is moderate. Note

that the selection of κ is based on the cross-validation

performance on the training data.

Similar as for MNIST in Section 5.1, we report clas-

sification results for NB-ML, NB-CG-MM, and NB-CG-

CL assuming at random missing features during clas-

sification in Figure 6. The x-axis denotes the number

of missing features. We average the performances over

100 classifications of the test data with randomly miss-

ing features. The standard deviation indicates that the

resulting differences are significant for a moderate num-

ber of missing features. Discriminatively parameterized

Page 10

10

00.10.20.30.40.5

κ

0.60.70.80.91

88.6

88.8

89

89.2

89.4

Classification Performance [%]

Fig. 5. Influence of parameter κ of the loss function on

the classification rate for λ = 0.02.

NB classifiers outperform NB-ML in the case of up to

150 missing features. Furthermore, we present results

for SVMs where first imputation methods are used to

complete missing feature values in the data. Afterwards,

SVMs are applied on the completed data. In particular,

we use two imputation approaches: (i) mean value im-

putation (the missing value is replaced with the mean

value of the feature of the training data set); (ii) kNN

value imputation – the missing value is replaced with the

mean value (for discretized data the most frequent value)

of the k-nearest neighbors. The neighbors of a sample

with missing features are determined by the Euclidean

distance in the relevant subspace. In the special case

where k equals the number of training instances M this

method is identical to mean value imputation. We use

k = 5. As shown in Figure 6, mean value imputation de-

grades the classification performance of SVMs in case of

missing features significantly. Handling missing features

with NB classifiers is easy since we can simply neglect

the conditional probability of the missing feature Zj in

Eq. (1), i.e. the joint probability is the product of the

available features only.

Fig. 6. Washington D.C. Mall: Classification results for

NB-ML, NB-CG-MM, NB-CG-CL, and SVMs (using mean

value imputation) assuming missing features.

Figure 7 shows kNN value imputation results for

SVMs and NB-CG-MM. kNN feature value imputation is

slow and requires the training data during classification

of samples with at random missing features. However,

it provides more information to the classifier compared

to simple summation over the missing feature values as

shown for the NB-CG-MM case.

Fig. 7. Washington D.C. Mall: Classification results for

NB-CG-MM (summation over missing feature values),

NB-CG-MM (using kNN most frequent value imputation),

and SVMs (using kNN mean value imputation) assuming

missing features.

5.3Margin Optimization using Convex Relaxation

In this section, we compare our CG-based margin opti-

mization to a recently proposed approach using convex

relaxation [1] in terms of classification accuracy and

computational efficiency. First we provide a short intro-

duction to convex relaxation for margin maximization

and give details on solving the convex problem for our

data. Unfortunately, Guo et al. [1] only provided results

on small-scale experiments, i.e. 50 samples and up to 36

features.

5.3.1

Guo et al. [1] proposed to solve the maximum mar-

gin parameter learning problem for Bayesian network

classifiers by reformulating it as a convex optimization

problem. They introduced the parameter vector w with

elements wj

i|h) (in some order) and, using the

same order for the elements, the feature vectors φ(zm)

with elements uj,m

can be written as PΘ(Z = zm) = exp(φ(zm)Tw), where

φ(zm)Tdenotes the transpose of φ(zm). The logarithm of

the multi-class margin (3) of the m-th sample becomes

Background

i|h= log(θj

i|h. Then, the probability of sample zm

logdm

Θ= min

c?=cm[φ(cm,xm

1:N) − φ(c,xm

1:N)]Tw.

In this way, the problem of learning the maximum

margin parameters of the Bayesian network can be recast

as

maximize

γ,w

γ

s.t. ∆m,cw ≥ γ,

∀m and c ?= cm,

γ ≥ 0,

|Zj|

X

i=1

exp(wj

i|h) = 1,

∀j,h,

Page 11

11

where γ is the logarithm of the minimum of all sample

margins and ∆m,c= [φ(cm,xm

constraint ensures that all sample margins are greater

than γ and the third constraint that w parameterizes a

valid Bayesian network, i.e. w describes valid probability

distributions.

Finally, by introducing one slack variable ǫmfor each

sample zm, relaxing the constraints on the parameter

vector w and rewriting the objective function, Guo et

al. derived the optimization problem

1:N) − φ(c,xm

1:N)]T. The first

minimize

γ,w,ǫ1,...,ǫM

1

2γ2+ B

M

X

m=1

ǫm

(9)

s.t. ∆m,cw ≥ γ − ǫm,

∀m and c ?= cm,

γ ≥ 0,

|Zj|

X

∀m,

i=1

exp(wj

i|h) ≤ 1,

∀j,h,

ǫm ≥ 0,

for determining the maximum margin parameters. The

parameter B can be used to control the slack effect

(similar as parameter C in SVMs). The above problem is

convex with convex inequality constraints. Hence, any

local minimum is also a global minimum. Furthermore,

under certain conditions on the structure of the Bayesian

network, the (typically) subnormalized parameter vector

w of a solution allows for re-normalization without

changing the decision function P(c|x1:N) (see [4], [5] for

details).

There are many possibilities to solve the optimization

problem in Eq. (9). Any minimization method allowing

for a nonlinear objective function and nonlinear convex

inequality constraints can be used in principle. We de-

cided to use the large scale solver IPOPT [40] which

shows good performance in several applications (see

e.g. [41]).10IPOPT applies an interior-point method [43]

to solve the problem in (9). It requires the objective func-

tion, its gradient, the constraint functions, the Jacobian

of the constraint functions, and the second derivative

of the Lagrangian function. To ensure short runtimes

and good results we used an adaptive strategy for the

barrier parameter, and let the algorithm run for up to 100

iterations or until sufficient precision was achieved.11We

refer to solutions obtained by IPOPT as CVX-MM.

5.3.2Experimental Comparison

Table 6 and Table 7 show the classification rates and run-

times for the different algorithms and datasets, respec-

tively. The classification rates of CVX-MM are slightly

10. We used IPOPT 3.9.2 in conjunction with MUMPS 4.9.2 [42], a

parallel sparse direct solver. IPOPT was compiled with Lapack 3.2.1

and BLAS from the Netlib repository (version from March 2007). IPOPT

is typically faster than the function fmincon of MATLAB for this type

of optimization problem.

11. Good classifiers do not require highly accurate solutions. Hence,

the tolerance for the objective function is set to 10−1– reducing the

runtime of the algorithm compared to using default tolerance settings.

better than those of CG-MM, while the proposed algo-

rithm CG-MM12is up to orders of magnitude faster.

For USPS the training data is separable with large

margin by a NB classifier, i.e. there exists a probability

distribution that factors according to a NB network

for which all samples in the training set are classified

correctly and for which samples from different classes

are separated by a large margin. Therefore, an optimal

solution of (9) has small objective for a large range of the

parameter B. This complicates the choice of B as well

as the tolerance settings for the interior-point optimizer

(the optimization problem has to be solved with high

precision while keeping the optimization tractable). In

our experiments we were not able to find a setting

such that the achieved classification rate on the test set

is larger than 90.90% which is much smaller than the

classification rate of CG-MM.

The large computational requirements of CVX-MM are

caused by the convex formulation in Eq. (9): there is

one inequality for each conditional probability of the

network and for every additional training sample the

number of inequalities increases by |C|, i.e. the number

of classes. Further, there is an additional slack variable

resulting in an increase of the dimension of the search

space. The used test sets were the same as described

above. The runtime experiments were performed on

a personal computer with 2.8 GHz CPUs, 16 GB of

memory, not exploiting any (multicore) parallelization.

Furthermore, we fixed the regularization parameter B

to 1 for the MNIST data because of time reasons, but

selected it using cross tuning for the TIMIT-4/6 and

USPS data.

TABLE 6

Classification rate (CR) in [%] for different data using a

naive Bayes classifier.

Parameter Learning

ML

CG-MM

CR

CR

83.73

91.82

87.10

95.23

87.90

92.09

88.69

92.97

87.67

91.57

81.82

85.43

82.26

86.20

81.93

84.85

CVX-MM

Data

MNIST

USPS

Ma+Fe-4

Ma-4

Fe-4

Ma+Fe-6

Ma-6

Fe-6

CR

92.04

90.90

92.31

93.09

91.82

85.61

86.67

85.46

B

1

1

3.9·10−3

3.9·10−3

3.9·10−3

3.9·10−3

3.9·10−3

3.9·10−3

Convex relaxation for margin optimization is interest-

ing due to its sound theoretical background. However,

without further algorithmic developments its practical

application seems to be limited to applications using

only few training data and a low number of features.

In contrast, the proposed method for maximum margin

parameter learning can deal with large sets of training

data efficiently and achieves comparable classification

rates. Furthermore, we observed superior runtime per-

12. CG-MM is implemented in MATLAB.

Page 12

12

TABLE 7

Runtimes in [s] for different data using a naive Bayes

classifier (B as in Table 6).

Parameter Learning

CG-MM

833

113

391

168

87

202

241

108

Data

MNIST

USPS

Ma+Fe-4

Ma-4

Fe-4

Ma+Fe-6

Ma-6

Fe-6

CVX-MM

54 hours

21 hours

1338

842

844

4566

3505

3002

formance of the proposed method in the conducted

experiments.

6CONCLUSION

We derived a discriminative parameter learning algo-

rithm for Bayesian network classifiers based on maximiz-

ing the margin. For margin optimization we introduced

a conjugate gradient algorithm. In contrast to previous

work on margin optimization in probabilistic models,

we kept the sum-to-one constraint which maintains the

probabilistic interpretation of the network, e.g. sum-

mation over missing variables is still possible. In the

experiments, we treat two cases of discriminative param-

eter learning – both optimization criteria (CL or MM)

were optimized with the CG method. Furthermore, we

applied various parameter learning algorithms on naive

Bayes and generatively and discriminatively optimized

TAN structures. Discriminative parameter learning sig-

nificantly outperforms ML parameter estimation. Fur-

thermore, maximizing the margin slightly improves the

classification performance compared to CL parameter

optimization in most cases.

Additionally, we provided empirical results for a max-

imum margin optimization approach based on convex

relaxation. The classification results of both maximum

margin parameter learning approaches are almost iden-

tical, whereas the computational requirements of our

CG-based optimization are up to orders of magnitude

lower. Margin-optimized Bayesian networks perform on

par with SVMs in terms of classification rate, however

the Bayesian network classifiers require fewer parame-

ters than the SVM and can directly deal with missing

features, a case where discriminative classifiers usually

require imputation techniques.

ACKNOWLEDGMENTS

The authors thank the anonymous reviewers for use-

ful comments that improved the quality of the paper.

Thanks to Jeff Bilmes for discussions and support in

writing this paper.

APPENDIX: CL PARAMETER LEARNING

The CG algorithm relies on the gradient of the objective

function given as

∂CLL(B|S)

∂θj

i|h

2

6

=

M

X

m=1

6

6

4

∂

∂θj

i|h

logPΘ(cm,xm

1:N) −

|C|

P

c=1

∂

i|hPΘ(c,xm

∂θj

1:N)

|C|

P

c=1PΘ(c,xm

1:N)

3

7

7

7

5.

Similar as in Section 3.3, we distinguish two cases for

differentiating CLL(B|S), i.e. either C = i for j = 1

(Case 1) or C = h1for j > 1 (Case 2).

Case 1: For the class variable, i.e. j = 1 and h = ∅, we

get

∂CLL(B|S)

∂θ1

i

=

M

X

m=1

»u1,m

i

θ1

i

−Wm

i

θ1

i

–

,

using Eq. (1) for differentiating the first term (omitting

the sum over j and h) and we introduced the class

posterior Wm

i

= PΘ(i|xm

PΘ(i,xm

P

Case 2: For the attribute variables, i.e. j > 1, we

differentiate correspondingly and have

"uj,m

1:N) as

Wm

i

=

1:N)

|C|

c=1PΘ(c,xm

1:N)

.

∂CLL(B|S)

∂θj

i|h

=

M

X

m=1

i|h

θj

i|h

− Wm

h1

vj,m

i|h\h1

θj

i|h

#

,

where Wm

and sample m and vj,m

h1= PΘ(h1|xm

1:N) is the posterior for class h1

i|h\h1= 1 1n

zm

j=i and zm

Πj=h\h1

o.

The conditional log likelihood given in Eq. (2) can be

optimized by a conjugate gradient algorithm using line-

search in a similar manner as given in Section 3.2. Again,

we re-parameterize the problem to incorporate the con-

straints on θj

i|hin the conjugate gradient algorithm. This

requires the gradient of CLL(B|S) with respect to βj

which is computed using the chain rule as

i|h

∂CLL(B|S)

∂β1

i

=

|Zj|

X

M

X

k=1

∂CLL(B|S)

∂θ1

k

∂θ1

∂β1

k

i

=

m=1

ˆu1,m

i

− Wm

i

˜− θ1

i

M

X

m=1

|C|

X

c=1

ˆu1,m

c

− Wm

c

˜

for Case 1. Similarly for Case 2, the gradient is

∂CLL(B|S)

∂βj

i|h

=

M

X

m=1

h

uj,m

i|h− Wm

h1vj,m

i|h\h1

i

− θj

i|h

M

X

m=1

|Zj|

X

l=1

h

uj,m

l|h− Wm

h1vj,m

l|h\h1

i

.

Page 13

13

REFERENCES

[1]Y. Guo, D. Wilkinson, and D. Schuurmans, “Maximum margin

Bayesian networks,” in International Conference on Uncertainty in

Artificial Intelligence (UAI), 2005, pp. 233–242.

V. Vapnik, Statistical learning theory.

B. Taskar, C. Guestrin, and D. Koller, “Max-margin Markov

networks,” in Advances in Neural Information Processing Systems

(NIPS), 2003.

H. Wettig, P. Gr¨ unwald, T. Roos, P. Myllym¨ aki, and H. Tirri,

“When discriminative learning of bayesian network parameters

is easy,” in International Joint Conference on Artificial Intelligence

(IJCAI), 2003, pp. 491 – 496.

T. Roos, H. Wettig, P. Gr¨ unwald, P. Myllym¨ aki, and H. Tirri, “On

discriminative Bayesian network classifiers and logistic regres-

sion,” Machine Learning, vol. 59, pp. 267–296, 2005.

F. Sha and L. Saul, “Comparison of large margin training to

other discriminative methods for phonetic recognition by hidden

Markov models,” in IEEE International Conference on Acoustics,

Speech, and Signal Processing (ICASSP), 2007, pp. 313–316.

G. Heigold, T. Deselaers, R. Schl¨ uter, and H. Ney, “Modified

MMI/MPE: A direct evaluation of the margin in speech recogni-

tion,” in International Conference on Machine Learning (ICML), 2008,

pp. 384–391.

R. Collobert, F. Siz, J. Weston, and L. Bottou, “Trading convexity

for scalability,” in International Conference on Machine Learning

(ICML), 2006, pp. 201–208.

C. Bishop, Neural networks for pattern recognition. Oxford Univer-

sity Press, 1995.

[10] R. Greiner, X. Su, S. Shen, and W. Zhou, “Structural extension

to logistic regression: Discriminative parameter learning of belief

net classifiers,” Machine Learning, vol. 59, pp. 297–322, 2005.

[11] O. Gopalakrishnan, D. Kanevsky, A. N` adas, and D. Nahamoo,

“An inequality for rational functions with applications to some

statistical estimation problems,” IEEE Transactions on Information

Theory, vol. 37, no. 1, pp. 107–113, 1991.

[12] F. Pernkopf and M. Wohlmayr, “On discriminative parameter

learning of Bayesian network classifiers,” in European Conference

on Machine Learning (ECML), 2009, pp. 221–237.

[13] P. Woodland and D. Povey, “Large scale discriminative training of

hidden Markov models for speech recognition,” Computer Speech

and Language, vol. 16, pp. 25–47, 2002.

[14] R. Schl¨ uter, W. Macherey, M. B., and H. Ney, “Comparison of

discriminative training criteria and optimization methods for

speech recognition,” Speech Communication, vol. 34, pp. 287–310,

2001.

[15] F. Pernkopf and M. Wohlmayr, “Maximum margin Bayesian

network classifiers,” Institute of Signal Processing and Speech

Communication, Graz University of Technology, Tech. Rep., 2010.

[16] ——, “Large margin learning of Bayesian classifiers based on

gaussian mixture models,” in European Conference on Machine

Learning (ECML), 2010, pp. 50–66.

[17] L. Lamel, R. Kassel, and S. Seneff, “Speech database development:

Design and analysis of the acoustic-phonetic corpus,” in DARPA

Speech Recognition Workshop, Report No. SAIC-86/1546, 1986.

[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based

learning applied to document recognition,” Proceedings fo the IEEE,

vol. 86, no. 11, pp. 2278–2324, 1998.

[19] J. Pearl, Probabilistic reasoning in intelligent systems: Networks of

plausible inference.Morgan Kaufmann, 1988.

[20] F. Pernkopf and J. Bilmes, “Efficient heuristics for discriminative

structure learning of Bayesian network classifiers,” Journal of

Machine Learning Research, vol. 11, pp. 2323–2360, 2010.

[21] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian network

classifiers,” Machine Learning, vol. 29, pp. 131–163, 1997.

[22] P. Domingos and M. Pazzani, “On the optimality of the simple

Bayesian classifier under zero-one loss,” Machine Learning, vol. 29,

no. 2-3, pp. 103–130, 1997.

[23] J. Bilmes, “Dynamic Bayesian multinets,” in 16th Inter. Conf. of

Uncertainty in Artificial Intelligence (UAI), 2000, pp. 38–45.

[24] R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhalter, Proba-

bilistic networks and expert systems.

[25] C. Bishop, Pattern recognition and machine learning. Springer, 2006.

[26] S. Acid, L. de Campos, and J. Castellano, “Learning Bayesian

network classifiers: Searching in a space of partially directed

acyclic graphs,” Machine Learning, vol. 59, pp. 213–235, 2005.

[2]

[3]

Wiley & Sons, 1998.

[4]

[5]

[6]

[7]

[8]

[9]

Springer Verlag, 1999.

[27] B. Sch¨ olkopf and A. Smola, Learning with kernels: Support Vector

Machines, regularization, optimization, and beyond. MIT Press, 2001.

[28] P. Huber, “Robust estimation of a location parameter,” Annals of

Statistics, vol. 53, pp. 73–101, 1964.

[29] O. Chapelle, “Training a support vector machine in the primal,”

Neural Computation, vol. 19, no. 5, pp. 1155–1178, 2007.

[30] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical

recipes in C.Cambridge Univ. Press, 1992.

[31] T. Cover and J. Thomas, Elements of information theory. John Wiley

& Sons, 1991.

[32] E. Keogh and M. Pazzani, “Learning augmented Bayesian classi-

fiers: A comparison of distribution-based and classification-based

approaches,” in Workshop on Artificial Intelligence and Statistics,

1999, pp. 225–230.

[33] F. Pernkopf, “Bayesian network classifiers versus selective k-NN

classifier,” Pattern Recognition, vol. 38, no. 3, pp. 1–10, 2005.

[34] D. Grossman and P. Domingos, “Learning Bayesian network

classifiers by maximizing conditional likelihood,” in Inter. Conf.

of Machine Lerning (ICML), 2004, pp. 361–368.

[35] P. Bartlett, M. Jordan, and J. McAuliffe, “Convexity, classification,

and risk bounds,” Journal of the American Statistical Association, vol.

101, no. 473, pp. 138–156, 2006.

[36] F. Pernkopf and M. Wohlmayr, “Stochastic margin-based structure

learning of Bayesian network classifiers,” Laboratory of Signal

Processing and Speech Communication, Graz University of Tech-

nology, Tech. Rep., 2011.

[37] F. Pernkopf and J. Bilmes, “Order-based discriminative structure

learning for Bayesian network classifiers,” in International Sympo-

sium on Artificial Intelligence and Mathematics, 2008.

[38] U.FayyadandK. Irani,

continuous-valued attributes for classification learning,” in Joint

Conf. on Artificial Intelligence, 1993, pp. 1022–1027.

[39] F. Pernkopf, T. Van Pham, and J. Bilmes, “Broad phonetic classi-

fication using discriminative Bayesian networks,” Speech Commu-

nication, vol. 143, no. 1, pp. 123–138, 2008.

[40] A. W¨ achter and L. Biegler, “On the implementation of an interior-

point filter line-search algorithm for large-scale nonlinear pro-

gramming,” Mathematical Programming, vol. 106, pp. 25–57, 2006.

[41] L. Biegler and V. Zavala, “Large-scale nonlinear programming

using IPOPT: An integrating framework for enterprise-wide dy-

namic optimization,” Computers & Chemical Engineering, vol. 33,

no. 3, pp. 575–582, 2009.

[42] P. Amestoy, I. Duff, J.-Y. L’Excellent, and J. Koster, “MUMPS:

A general purpose distributed memory sparse solver,” in 5th

International Workshop on Applied Parallel Computing.

Verlag, 2000, pp. 122–131.

[43] S. Boyd and L. Vandenberghe, Convex Optimization.

University Press, March 2004.

“Multi-interval discretizatonof

Springer-

Cambridge

Franz Pernkopf received his MSc (Dipl. Ing.)

degree in Electrical Engineering at Graz Uni-

versity of Technology, Austria, in summer 1999.

He earned a PhD degree from the University

of Leoben, Austria, in 2002. In 2002 he was

awarded the Erwin Schr¨ odinger Fellowship. He

was a Research Associate in the Department of

Electrical Engineering at the University of Wash-

ington, Seattle, from 2004 to 2006. Currently,

he is Associate Professor at the Laboratory of

Signal Processing and Speech Communication,

Graz University of Technology, Austria. His research interests include

machine learning, discriminative learning, graphical models, feature

selection, finite mixture models, and image- and speech processing

applications.

Page 14

14

Michael Wohlmayr graduated from Graz Uni-

versity of Technology in June 2007. He con-

ducted his Master thesis in collaboration with

University of Crete, Greece. Since October

2007, he is pursuing the PhD degree at the

Signal Processing and Speech Communication

Laboratory at Graz University of Technology. His

research interests include Bayesian networks,

speech and audio analysis, as well as statistical

pattern recognition.

Sebastian Tschiatschek received the BSc de-

gree and MSc degree (with distinction) in Elec-

trical Engineering at Graz University of Tech-

nology (TUG) in 2007 and 2010, respectively.

He conducted his Master thesis during a one-

year stay at ETH Z¨ urich, Switzerland. Currently,

he is with the Signal Processing and Speech

Communication Laboratory at TUG where he is

pursuing the PhD degree. His research interests

include Bayesian networks, information theory in

conjunction with graphical models and statistical

pattern recognition.