Maximum Margin Bayesian Network Classifiers

Dept. of Electr. Eng., Graz Univ. of Technol., Graz, Austria
IEEE Transactions on Pattern Analysis and Machine Intelligence (Impact Factor: 5.78). 04/2012; 34(3):521 - 532. DOI: 10.1109/TPAMI.2011.149
Source: IEEE Xplore


We present a maximum margin parameter learning algorithm for Bayesian network classifiers using a conjugate gradient (CG) method for optimization. In contrast to previous approaches, we maintain the normalization constraints on the parameters of the Bayesian network during optimization, i.e., the probabilistic interpretation of the model is not lost. This enables us to handle missing features in discriminatively optimized Bayesian networks. In experiments, we compare the classification performance of maximum margin parameter learning to conditional likelihood and maximum likelihood learning approaches. Discriminative parameter learning significantly outperforms generative maximum likelihood estimation for naive Bayes and tree augmented naive Bayes structures on all considered data sets. Furthermore, maximizing the margin dominates the conditional likelihood approach in terms of classification performance in most cases. We provide results for a recently proposed maximum margin optimization approach based on convex relaxation [1]. While the classification results are highly similar, our CG-based optimization is computationally up to orders of magnitude faster. Margin-optimized Bayesian network classifiers achieve classification performance comparable to support vector machines (SVMs) using fewer parameters. Moreover, we show that unanticipated missing feature values during classification can be easily processed by discriminatively optimized Bayesian network classifiers, a case where discriminative classifiers usually require mechanisms to complete unknown feature values in the data first.

Download full-text


Available from: Franz Pernkopf,
  • Source
    • "The resulting algorithm, called stochastic discriminative EM (sdEM), is an online-EM-type algorithm that can train generative probabilistic models belonging to the exponential family using a wide range of discriminative loss functions , such as the negative conditional log-likelihood or the Hinge loss. In opposite to other discriminative learning approaches [26], models trained by sdEM can deal with missing data and latent variables in a principled way either when being learned or when making predictions, because at any moment they always define a joint probability distribution. sdEM could be used for learning using large scale data sets due to its stochastic approximation nature and, as we will show, because it allows to compute the natural gradient of the loss function with no extra cost [3]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Stochastic discriminative EM (sdEM) is an online-EM-type algorithm for discriminative training of probabilistic generative models belonging to the exponential family. In this work, we introduce and justify this algorithm as a stochastic natural gradient descent method, i.e. a method which accounts for the information geometry in the parameter space of the statistical model. We show how this learning algorithm can be used to train probabilistic generative models by minimizing different discriminative loss functions, such as the negative conditional log-likelihood and the Hinge loss. The resulting models trained by sdEM are always generative (i.e. they define a joint probability distribution) and, in consequence, allows to deal with missing data and latent variables in a principled way either when being learned or when making predictions. The performance of this method is illustrated by several text classification problems for which a multinomial naive Bayes and a latent Dirichlet allocation based classifier are learned using different discriminative loss functions.
  • Source
    • "a) Discriminatively versus generatively optimized parameters: Here, we compare the classification performance of BNCs with MAP parameters and of BNCs with MM parameters over varying numbers of bits used for quantization. MM parameters are determined using the algorithm described in [4]. The structures considered are NB and TAN-CMI. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bayesian network classifier (BNCs) are typically implemented on nowadays desktop computers. However, many real world applications require classifier implementation on embedded or low power systems. Aspects for this purpose have not been studied rigorously. We partly close this gap by analyzing reduced precision implementations of BNCs. In detail, we investigate the quantization of the parameters of BNCs with discrete valued nodes including the implications on the classification rate (CR). We derive worst-case and probabilistic bounds on the CR for different bit-widths. These bounds are evaluated on several benchmark datasets. Furthermore, we compare the classification performance and the robustness of BNCs with generatively and discriminatively optimized parameters, i.e. parameters optimized for high data likelihood and parameters optimized for classification, with respect to parameter quantization. Generatively optimized parameters are more robust for very low bit-widths, i.e. less classifications change because of quantization. However, classification performance is better for discriminatively optimized parameters for all but very low bit-widths. Additionally, we perform analysis for margin-optimized tree augmented network (TAN) structures which outperform generatively optimized TAN structures in terms of CR and robustness.
    IEEE Transactions on Pattern Analysis and Machine Intelligence 01/2014; DOI:10.1109/TPAMI.2014.2353620 · 5.78 Impact Factor
  • Source
    • "objective and the maximum margin (MM) [11] [1] "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bayesian network classifiers are probabilistic classifiers achieving good classification rates in various applications. These classifiers consist of a directed acyclic graph and a set of conditional proba-bility densities, which in case of discrete-valued nodes can be repre-sented by conditional probability tables. In this paper, we investigate the effect of quantizing these conditional probabilities. We derive worst-case and best-case bounds on the classification rate using in-terval arithmetic. Furthermore, we determine performance bounds that hold with a user specified confidence using quantization theory. Our results emphasize that only small bit-widths are necessary to achieve good classification rates.
    IEEE International Conference on Acoustics, Speech and Signal Processing; 10/2013
Show more