Conference Paper

# Probing the Space of Optimal Markov Logic Networks for Sequence Labeling

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

Discovering relational structure between input features in sequence labeling models has shown to improve their accuracies in several problem settings. The problem of learning relational structure for sequence labeling can be posed as learning Markov Logic Networks (MLN) for sequence labeling, which we abbreviate as Markov Logic Chains (MLC). This objective in propositional space can be solved efficiently and optimally by a Hierarchical Kernels based approach, referred to as StructRELHKL, which we recently proposed. However, the applicability of StructRELHKL in complex first order settings is non-trivial and challenging. We present the challenges and possibilities for optimally and simultaneously learning the structure as well as parameters of MLCs (as against learning them separately and/or greedily). Here, we look into leveraging the StructRELHKL approach for optimizing the MLC learning steps to the extent possible. To this end, we categorize first order MLC features based on their complexity and show that complex features can be constructed from simpler ones. We define a self-contained class of features called absolute features ($$\mathcal{AF}$$), which can be conjoined to yield complex MLC features. Our approach first generates a set of relevant $$\mathcal{AF}$$s and then makes use of the algorithm for StructRELHKL to learn their optimal conjunctions. We demonstrate the efficiency of our approach by evaluating on a publicly available activity recognition dataset.

## No full-text available

... Our model parameter tuning for all the experiments were not performed in the best desired way, since the huge problem space limits the scalability of approaches tried. However, we have tuned the parameters in the best possible way, given the constraints and resources we had 18 . The results are summarised in Table 1. ...
... preparingDinner(T ) ← groceriesCupboard(T ) 18 For making our approaches scalable on such large data, we suggest, (i) parallelization using hadoop, (ii) using heuristics in our computations, and (iii) allowing larger epsilon values (amount by which the output from certain procedures can deviate from the optimum). We leave these for future work. ...
... This part of our work has appeared at ILP, 2012[18].5 This part of our work has appeared in ILP 2013[19].6 ψ(X, Y) stands for features representing the characteristics of input and output variables, and their relations. ...
Article
Full-text available
Discovering relational structure between input features in sequence labeling models has shown to improve their accuracy in several problem settings. However, the search space of relational features is exponential in the number of basic input features. Consequently, approaches that learn relational features, tend to follow a greedy search strategy. In this paper, we study the possibility of optimally learning and applying discriminative relational features for sequence labeling. For learning features derived from inputs at a particular sequence position, we propose a Hierarchical Kernels-based approach (referred to as Hierarchical Kernel Learning for Structured Output Spaces - StructHKL). This approach optimally and efficiently explores the hierarchical structure of the feature space for problems with structured output spaces such as sequence labeling. Since the StructHKL approach has limitations in learning complex relational features derived from inputs at relative positions, we propose two solutions to learn relational features namely, (i) enumerating simple component features of complex relational features and discovering their compositions using StructHKL and (ii) leveraging relational kernels, that compute the similarity between instances implicitly, in the sequence labeling problem. We perform extensive empirical evaluation on publicly available datasets and record our observations on settings in which certain approaches are effective.
Article
The Indian Institute of Technology (IIT) Bombay has a history of research and development in the area of databases, dating back to the early 1980s. D. B. Phatak and N. L. Sarda were among the first faculty members at IIT Bombay to work in the area of database systems. The number of PhD students increased from around 1 or 2 enrolled at a time in the early 1990s, to about 12 to 15 students at a time in recent years. While this number is much better than earlier, and is increasing rapidly, it is still small by most standards. However, the master's and bachelor's students have compensated for the shortage of PhD students, and have made very significant contributions to the research efforts, with well over three fourths of the papers having such students as coauthors. Graph data models are ubiquitous in semistructured search. Modeling a data graph as an electrical network, or equivalently, as a Markovian 'random surfer' process, is widely used in applications that need to characterize some notion of graph proximity.
Article
Full-text available
This paper describes a prototype system for matching data provided by a wireless network of autonomous reed switch devices with activities of daily living in a home environment.
Article
Full-text available
Many real world sequences such as protein secondary structures or shell logs exhibit a rich internal structures. Traditional probabilistic models of sequences, however, consider sequences of flat symbols only. Logical hidden Markov models have been proposed as one solution. They deal with logical sequences, i.e., sequences over an alphabet of logical atoms. This comes at the expense of a more complex model selection problem. Indeed, different abstraction levels have to be explored. In this paper, we propose a novel method for selecting logical hidden Markov models from data called SAGEM. SAGEM combines generalized expectation maximization, which optimizes parameters, with structure search for model selection using inductive logic programming refinement operators. We provide convergence and experimental results that show SAGEM's effectiveness.
Conference Paper
Full-text available
A sensor system capable of automatically recognizing activities would allow many potential ubiquitous applications. In this paper, we present an easy to install sensor network and an accurate but inexpensive annotation method. A recorded dataset consisting of 28 days of sensor data and its annotation is described and made available to the community. Through a number of experiments we show how the hidden Markov model and conditional random fields perform in recognizing activities. We achieve a timeslice accuracy of 95.6% and a class accuracy of 79.4%.
Conference Paper
Full-text available
. In this paper we present 1BC, a first-order Bayesian Classifier. Our approach is to view individuals as structured terms, and to distinguish between structural predicates referring to subterms (e.g. atoms from molecules), and properties applying to one or several of these subterms (e.g. a bond between two atoms). We describe an individual in terms of elementary features consisting of zero or more structural predicates and one property; these features are considered conditionally independent following the usual naive Bayes assumption. 1BC has been implemented in the context of the first-order descriptive learner Tertius, and we describe several experiments demonstrating the viability of our approach. 1 Introduction In this paper we present 1BC, a first-order Bayesian Classifier. While the propositional Bayesian Classifier makes the naive Bayes assumption of statistical independence of elementary features (one attribute taking on a particular value) given the class value, it is not i...
Conference Paper
Full-text available
One of the current challenges in artificial intelligence is modeling dynamic environments that change due to the actions or activities undertaken by people or agents. The task of inferring hidden states, e.g. the activities or intentions of people, based on observations is called filtering. Standard probabilistic models such as Dynamic Bayesian Networks are able to solve this task efficiently using approximative methods such as particle filters. However, these models do not support logical or relational representations. The key contribution of this paper is the upgrade of a particle filter algorithm for use with a probabilistic logical representation through the definition of a proposal distribution. The performance of the algorithm depends largely on how well this distribution fits the target distribution. We adopt the idea of logical compilation into Binary Decision Diagrams for sampling. This allows us to use the optimal proposal distribution which is normally prohibitively slow.
Conference Paper
Full-text available
This paper addresses the problem of Rule Ensemble Learning (REL), where the goal is simultaneous discovery of a small set of simple rules and their optimal weights that lead to good generalization. Rules are assumed to be conjunctions of basic propositions concerning the values taken by the input features. From the perspectives of interpretability as well as generalization, it is highly desirable to construct rule ensembles with low training error, having rules that are i) simple, i.e., involve few conjunctions and ii) few in number. We propose to explore the (exponentially) large feature space of all possible conjunctions optimally and efficiently by employing the recently introduced Hierarchical Kernel Learning (HKL) framework. The regularizer employed in the HKL formulation can be interpreted as a potential for discouraging selection of rules involving large number of conjunctions - justifying its suitability for constructing rule ensembles. Simulation results show that, in case of many benchmark datasets, the proposed approach improves over state-of-the-art REL algorithms in terms of generalization and indeed learns simple rules. Unfortunately, HKL selects a conjunction only if all its subsets are selected. We propose a novel convex formulation which alleviates this problem and generalizes the HKL framework. The main technical contribution of this paper is an efficient mirrordescent based active set algorithm for solving the new formulation. Empirical evaluations on REL problems illustrate the utility of generalized HKL.
Conference Paper
Full-text available
Conditional Random Fields (CRFs) provide a powerful instrument for labeling sequences. So far, however, CRFs have only been considered for labeling sequences over flat alphabets. In this paper, we describe TildeCRF, the first method for training CRFs on logical sequences, i.e., sequences over an alphabet of logical atoms. TildeCRF's key idea is to use relational regression trees in Dietterich et al.'s gradient tree boosting approach. Thus, the CRF potential functions are represented as weighted sums of relational regression trees. Experiments show a significant improvement over established results achieved with hidden Markov models and Fisher kernels for logical sequences.
Conference Paper
Full-text available
Most of the existing weight-learning algorithms for Markov Logic Networks (MLNs) use batch training which becomes computationally expensive and even infeasible for very large datasets since the training examples may not fit in main memory. To overcome this problem, previous work has used online learning algorithms to learn weights for MLNs. However, this prior work has only applied existing online algorithms, and there is no comprehensive study of online weight learning for MLNs. In this paper, we derive new online algorithms for structured prediction using the primaldual framework, apply them to learn weights for MLNs, and compare against existing online algorithms on two large, real-world datasets. The experimental results show that the new algorithms achieve better accuracy than existing methods.
Conference Paper
Full-text available
Many real-world applications of AI require both probability and first-order logic to deal with uncertainty and structural complexity. Logical AI has focused mainly on handling complexity, and statistical AI on handling uncertainty. Markov Logic Networks (MLNs) are a powerful representation that combine Markov Networks (MNs) and first-order logic by attaching weights to first-order formulas and viewing these as templates for features of MNs. State-of-the-art structure learning algorithms of MLNs maximize the likelihood of a relational database by performing a greedy search in the space of candidates. This can lead to suboptimal results because of the incapability of these approaches to escape local optima. Moreover, due to the combinatorially explosive space of potential candidates these methods are computationally prohibitive. We propose a novel algorithm for learning MLNs structure, based on the Iterated Local Search (ILS) metaheuristic that explores the space of structures through a biased sampling of the set of local optima. The algorithm focuses the search not on the full space of solutions but on a smaller subspace defined by the solutions that are locally optimal for the optimization engine. We show through experiments in two real-world domains that the proposed approach improves accuracy and learning time over the existing state-of-the-art algorithms.
Conference Paper
Full-text available
Recent years have seen a surge of interest in Statistical Relational Learning (SRL) models that combine logic with probabilities. One prominent example is Markov Logic Networks (MLNs). While MLNs are indeed highly expressive, this expressiveness comes at a cost. Learning MLNs is a hard problem and therefore has attracted much interest in the SRL community. Current methods for learning MLNs follow a two-step approach: first, perform a search through the space of possible clauses and then learn appropriate weights for these clauses. We propose to take a different approach, namely to learn both the weights and the structure of the MLN simultaneously. Our approach is based on functional gradient boosting where the problem of learning MLNs is turned into a series of relational functional approximation problems. We use two kinds of representations for the gradients: clause-based and tree-based. Our experimental evaluation on several benchmark data sets demonstrates that our new approach can learn MLNs as good or better than those found with state-of-the-art methods, but often in a fraction of the time.
Conference Paper
Full-text available
Most existing learning methods for Markov Logic Networks (MLNs) use batch training, which becomes computationally expensive and eventually infeasible for large datasets with thousands of training examples which may not even all fit in main memory. To address this issue, previous work has used online learning to train MLNs. However, they all assume that the model’s structure (set of logical clauses) is given, and only learn the model’s parameters. However, the input structure is usually incomplete, so it should also be updated. In this work, we present OSL—the first algorithm that performs both online structure and parameter learning for MLNs. Experimental results on two real-world datasets for natural-language field segmentation show that OSL outperforms systems that cannot revise structure.
Article
Full-text available
Logical hidden Markov models (LOHMMs) upgrade traditional hidden Markov models to deal with sequences of structured symbols in the form of logical atoms, rather than flat characters. This note formally introduces LOHMMs and presents solutions to the three central inference problems for LOHMMs: evaluation, most likely hidden state sequence and parameter estimation. The resulting representation and algorithms are experimentally evaluated on problems from the domain of bioinformatics.
Article
Full-text available
Discovery of frequent patterns has been studied in a variety of data mining settings. In its simplest form, known from association rule mining, the task is to discover all frequent itemsets, i.e., all combinations of items that are found in a sufficient number of examples. The fundamental task of association rule and frequent set discovery has been extended in various directions, allowing more useful patterns to be discovered with special purpose algorithms. We present WARMR, a general purpose inductive logic programming algorithm that addresses frequent query discovery: a very general DATALOG formulation of the frequent pattern discovery problem. The motivation for this novel approach is twofold. First, exploratory data mining is well supported: WARMR offers the flexibility required to experiment with standard and in particular novel settings not supported by special purpose algorithms. Also, application prototypes based on WARMR can be used as benchmarks in the comparison and evaluation of new special purpose algorithms. Second, the unified representation gives insight to the blurred picture of the frequent pattern discovery domain. Within the DATALOG formulation a number of dimensions appear that relink diverged settings. We demonstrate the frequent query approach and its use on two applications, one in alarm analysis, and one in a chemical toxicology domain.
Article
Full-text available
One of the goals of artificial intelligence is to develop agents that learn and act in complex environments. Realistic environments typically feature a variable number of objects, relations amongst them, and non-deterministic transition behavior. While standard probabilistic sequence models provide efficient inference and learning techniques for sequential data, they typically cannot fully capture the relational complexity. On the other hand, statistical relational learning techniques are often too inefficient to cope with complex sequential data. In this paper, we introduce a simple model that occupies an intermediate position in this expressiveness/efficiency trade-off. It is based on CP-logic (Causal Probabilistic Logic), an expressive probabilistic logic for modeling causality. However, by specializing CP-logic to represent a probability distribution over sequences of relational state descriptions and employing a Markov assumption, inference and learning become more tractable and effective. Specifically, we show how to solve part of the inference and learning problems directly at the first-order level, while transforming the remaining part into the problem of computing all satisfying assignments for a Boolean formula in a binary decision diagram. We experimentally validate that the resulting technique is able to handle probabilistic relational domains with a substantial number of objects and relations.
Article
Full-text available
Inductive Logic Programming (ILP) is an area of Machine Learning which has now reached its twentieth year. Using the analogy of a human biography this paper recalls the development of the subject from its infancy through childhood and teenage years. We show how in each phase ILP has been characterised by an attempt to extend theory and implementations in tandem with the development of novel and challenging real-world applications. Lastly, by projection we suggest directions for research which will help the subject coming of age. KeywordsInductive Logic Programming–(Statistical) relational learning–Structured data in Machine Learning
Article
Full-text available
Learning general functional dependencies is one of the main goals in machine learning. Recent progress in kernel-based...
Article
Full-text available
We consider the problem of high-dimensional non-linear variable selection for supervised learning. Our approach is based on performing linear selection among exponentially many appropriately defined positive definite kernels that characterize non-linear interactions between the original variables. To select efficiently from these many kernels, we use the natural hierarchical structure of the problem to extend the multiple kernel learning framework to kernels that can be embedded in a directed acyclic graph; we show that it is then possible to perform kernel selection through a graph-adapted sparsity-inducing norm, in polynomial time in the number of selected kernels. Moreover, we study the consistency of variable selection in high-dimensional settings, showing that under certain assumptions, our regularization framework allows a number of irrelevant variables which is exponential in the number of observations. Our simulations on synthetic datasets and datasets from the UCI repository show state-of-the-art predictive performance for non-linear regression problems.
Book
Propositional logic.- First-order logic.- Normal forms and Herbrand models.- Resolution.- Subsumption theorem and refutation completeness.- Linear and input resolution.- SLD-resolution.- SLDNF-resolution.- What is inductive logic programming?.- The framework for model inference.- Inverse resolution.- Unfolding.- The lattice and cover structure of atoms.- The subsumption order.- The implication order.- Background knowledge.- Refinement operators.- PAC learning.- Further topics.
Article
Conditional Random Fields (CRFs) are undirected graphical models, a special case of which correspond to conditionally-trained finite state machines. A key advantage of these models is their great flexibility to include a wide array of overlapping, multi-granularity, non-independent features of the input. In face of this freedom, an important question that remains is, what features should be used? This paper presents a feature induction method for CRFs. Founded on the principle of constructing only those feature conjunctions that significantly increase log-likelihood, the approach is based on that of Della Pietra et al [1997], but altered to work with conditional rather than joint probabilities, and with additional modifications for providing tractability specifically for a sequence model. In comparison with traditional approaches, automated feature induction offers both improved accuracy and more than an order of magnitude reduction in feature count; it enables the use of richer, higher-order Markov models, and offers more freedom to liberally guess about which atomic features may be relevant to a task. The induction method applies to linear-chain CRFs, as well as to more arbitrary CRF structures, also known as Relational Markov Networks [Taskar & Koller, 2002]. We present experimental results on a named entity extraction task.
Conference Paper
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.
Conference Paper
Intelligent agents must be able to handle the complexity and uncertainty of the real world. Logical AI has focused mainly on the former, and statistical AI on the latter. Markov logic combines the two by attaching weights to rst-order formu- las and viewing them as templates for features of Markov networks. Inference algorithms for Markov logic draw on ideas from satisability , Markov chain Monte Carlo and knowledge-based model construction. Learning algorithms are based on the voted perceptron, pseudo-likelihood and in- ductive logic programming. Markov logic has been success- fully applied to problems in entity resolution, link predic- tion, information extraction and others, and is the basis of the open-source Alchemy system.
Conference Paper
Statistical machine learning is in the midst of a “relational revolution”. After many decades of focusing on independent and identically-distributed (iid) examples, many researchers are now studying problems in which the examples are linked together into complex networks. These networks ca be a simple as sequences and 2-D meshes (such as those arising in part-of-speech tagging and remote sensing) or as complex as citation graphs, the world wide web, and relational data bases. Statistical relational learning raises many new challenges and opportunities. Because the statistical model depends on the domain’s relational structure, parameters in the model are often tied. This has advantages for making parameter estimation feasible, but complicates the model search. Because the “features” involve relationships among multiple objects, there is often a need to intelligently construct aggregates and other relational features. Problems that arise from linkage and autocorrelation among objects must be taken into account. Because instances are linked together, classification typically involves complex inference to arrive at “collective classification” in which the labels predicted for the test instances are determined jointly rather than individually. Unlike iid problems, where the result of learning is a single classifier, relational learning often involves instances that are heterogeneous, where the result of learning is a set of multiple components (classifiers, probability distributions, etc.) that predict labels of objects and logical relationships between objects.
Conference Paper
This paper describes the design of the inductive logic programming system Lime. Instead of employing a greedy covering approach to constructing clauses, Lime employs a Bayesian heuristic to evaluate logic programs as hypotheses. The notion of a simple clause is introduced. These sets of literals may be viewed as subparts of clauses that are efiectively independent in terms of variables used. Instead of growing a clause one literal at a time, Lime efficiently combines simple clauses to construct a set of gainful candidate clauses. Subsets of these candidate clauses are evaluated via the Bayesian heuristic to find the final hypothesis. Details of the algorithms and data structures of Lime are discussed. Lime’s handling of recursive logic programs is also described. Experimental results to illustrate how Lime achieves its design goals of better noise handling, learning from fixed set of examples (and from only positive data), and of learning recursive logic programs are provided. Experimental results comparing Lime with FOIL and PROGOL in the KRK domain in the presence of noise are presented. It is also shown that the already good noise handling performance of Lime further improves when learning recursive definitions in the presence of noise.
Conference Paper
Markov logic networks (MLNs) combine logic and probability by attaching weights to rst-order clauses, and viewing these as templates for features of Markov networks. In this paper we develop an algorithm for learning the structure of MLNs from relational databases, combining ideas from inductive logic pro- gramming (ILP) and feature induction in Markov net- works. The algorithm performs a beam or shortest- rst search of the space of clauses, guided by a weighted pseudo-likelihood measure. This requires computing the optimal weights for each candidate structure, but we show how this can be done ef- ciently. The algorithm can be used to learn an MLN from scratch, or to rene an existing knowledge base. We have applied it in two real-world domains, and found that it outperforms using off-the-shelf ILP sys- tems to learn the MLN structure, as well as pure ILP, purely probabilistic and purely knowledge-based ap- proaches.
Conference Paper
Hidden Markov Models (HMMs) are widely used in activity recognition. Ideally, the current activity should be determined using the vector of all sensor readings; however, this results in an exponentially large space of observations. The current fix to this problem is to assume conditional independence between individual sensors, given an activity, and factorizing the emission distribution in a naive way. In several cases, this leads to accuracy loss. We present an intermediate solution, viz., determining a mapping between each activity and conjunctions over a relevant subset of dependent sensors. The approach discovers features that are conjunctions of sensors and maps them to activities. This does away the assumption of naive factorization while not ruling out the possibility of the vector of all the sensor readings being relevant to activities. We demonstrate through experimental evaluation that our approach prunes potentially irrelevant subsets of sensor readings and results in significant accuracy improvements.
Article
Automated planning requires action models described using languages such as the Planning Domain Definition Language (PDDL) as input, but building action models from scratch is a very difficult and time-consuming task, even for experts. This is because it is difficult to formally describe all conditions and changes, reflected in the preconditions and effects of action models. In the past, there have been algorithms that can automatically learn simple action models from plan traces. However, there are many cases in the real world where we need more complicated expressions based on universal and existential quantifiers, as well as logical implications in action models to precisely describe the underlying mechanisms of the actions. Such complex action models cannot be learned using many previous algorithms. In this article, we present a novel algorithm called LAMP (Learning Action Models from Plan traces), to learn action models with quantifiers and logical implications from a set of observed plan traces with only partially observed intermediate state information. The LAMP algorithm generates candidate formulas that are passed to a Markov Logic Network (MLN) for selecting the most likely subsets of candidate formulas. The selected subset of formulas is then transformed into learned action models, which can then be tweaked by domain experts to arrive at the final models. We evaluate our approach in four planning domains to demonstrate that LAMP is effective in learning complex action models. We also analyze the human effort saved by using LAMP in helping to create action models through a user study. Finally, we apply LAMP to a real-world application domain for software requirement engineering to help the engineers acquire software requirements and show that LAMP can indeed help experts a great deal in real-world knowledge-engineering applications.
Article
We propose a simple approach to combining first-order logic and probabilistic graphical models in a single representation. A Markov logic network (MLN) is a first-order knowledge base with a weight attached to each formula (or clause). Together with a set of constants representing objects in the domain, it specifies a ground Markov network containing one feature for each possible grounding of a first-order formula in the KB, with the corresponding weight. Inference in MLNs is performed by MCMC over the minimal subset of the ground network required for answering the query. Weights are efficiently learned from relational databases by iteratively optimizing a pseudo-likelihood measure. Optionally, additional clauses are learned using inductive logic programming techniques. Experiments with a real-world database and knowledge base in a university domain illustrate the promise of this approach.
Article
The Viterbi algorithm (VA) is a recursive optimal solution to the problem of estimating the state sequence of a discrete-time finite-state Markov process observed in memoryless noise. Many problems in areas such as digital communications can be cast in this form. This paper gives a tutorial exposition of the algorithm and of how it is implemented and analyzed. Applications to date are reviewed. Increasing use of the algorithm in a widening variety of areas is foreseen.
Article
This paper presents a feature induction method for CRFs. Founded on the principle of constructing only those feature conjunctions that significantly increase loglikelihood, the approach builds on that of Della Pietra et al (1997), but is altered to work with conditional rather than joint probabilities, and with a mean-field approximation and other additional modifications that improve efficiency specifically for a sequence model. In comparison with traditional approaches, automated feature induction offers both improved accuracy and significant reduction in feature count; it enables the use of richer, higherorder Markov models, and offers more freedom to liberally guess about which atomic features may be relevant to a task
Article
We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. Conditional random fields also avoid a fundamental limitation of maximum entropy Markov models (MEMMs) and other discriminative Markov models based on directed graphical models, which can be biased towards states with few successor states. We present iterative parameter estimation algorithms for conditional random fields and compare the performance of the resulting models to HMMs and MEMMs on synthetic and natural-language data.