Book

# Machine Learning: ECML-97: 9th European Conference on Machine Learning Prague, Czech Republic, April 23–25, 1997 Proceedings

## Abstract

This book constitutes the refereed proceedings of the Ninth European Conference on Machine Learning, ECML-97, held in Prague, Czech Republic, in April 1997.
This volume presents 26 revised full papers selected from a total of 73 submissions. Also included are an abstract and two papers corresponding to the invited talks as well as descriptions from four satellite workshops. The volume covers the whole spectrum of current machine learning issues.

## Chapters (25)

Traditional wisdom has it that the better a theory compresses the learning data concerning some phenomenon under investigation, the better we learn, generalize, and the better the theory predicts unknown data. This belief is vindicated in practice but apparently has not been rigorously proved in a general setting. Making these ideas rigorous involves the length of the shortest effective description of an individual object: its Kolmogorov complexity. In a previous paper we have shown that optimal compression is almost always a best strategy in hypotheses identification (an ideal form of the minimum description length (MDL) principle). Whereas the single best hypothesis does not necessarily give the best prediction, we demonstrate that nonetheless compression is almost always the best strategy in prediction methods in the style of R. Solomonoff.

The aim of relational learning is to develop methods for the induction of descriptions in representation formalisms that are more expressive than attribute-value representation. Feature terms have been studied to formalize object-centered representation in declarative languages and can be seen as a subset of first-order logic. We present a representation formalism based on feature terms and we show how induction can be performed in a natural way using a notion of subsumption based on an informational ordering. Moreover feature terms also allow to specify incomplete information in a natural way. An example of such inductive methods, indie, is presented. indie performs bottom-up heuristic search on the subsumption lattice of the feature term space. Results of this method on several domains are explained.

One of the most interesting problems faced by Artificial Intelligence researchers is to reproduce a capability typical of living beings: that of learning to perform motor tasks, a problem known as skill acquisition. A very difficult purpose because the overwhole behavior of an agent is the result of quite a complex activity, involving sensory, planning and motor processing. In this paper, I present a novel approach for acquiring new skills, named Soft Teaching, that is characterized by a learning by experience process, in which an agent exploits a symbolic, qualitative description of the task to perform, that cannot, however, be used directly for control purposes. A specific Soft Teaching technique, named Symmetries, was implemented and tested against a continuous-domained version of well-known pole-balancing.

- Pawel Cichosz

Reinforcement learning systems learn to act in an uncertain environment by executing actions and observing their long-term effects. A large number of time steps may be required before this trial-and-error process converges to a satisfactory policy. It is highly desirable that the number of experiences needed by the system to learn to perform its task be minimized, particularly if making errors costs much. One approach to achieve this goal is to use hypothetical experiences, which requires some additional computation, but may reduce the necessary number of much more costly real experiences. This well-known idea of augmenting reinforcement learning by planning is revisited in this paper in the context of truncated TD(lambda), or TTD, a simple computational technique which allows reinforcement learning algorithms based on the methods of temporal differences to learn considerably faster with essentially no additional computational expense. Two different ways of combining TTD with planning are proposed which make it possible to benefit from lambda > 0 in both the learning and planning processes. The algorithms are evaluated experimentally on a family of grid path-finding tasks and shown to indeed yield a considerable reduction of the number of real interactions with the environment necessary to converge, as well as an improvement of scaling properties.

Structural matching, originally introduced by Steven Vere, implements and formalizes the notion of a most specific generalisation of two productions, possibly in the presence of a background theory. Despite various studies in the mid-seventies and early eighties, several problems remained. These include the use of background knowledge, the nonuniqueness of most specific generalisations, and handling in-equalities. We show how Gordon Plotkin's notions of least general generalisation and relative least general generalisation defined on clauses can be adapted for use in structural matching such that the remaining problems disappear. Defining clauses as universally quantified disjunctions of literals and productions as existentially quantified conjunctions of literals, it is shown that the lattice on clauses imposed by -subsumption is order-isomorphic to the lattice on productions needed for structural matching.

A new classification algorithm called VFI (for Voting Feature Intervals) is proposed. A concept is represented by a set of feature intervals on each feature dimension separately. Each feature participates in the classification by distributing real-valued votes among classes. The class receiving the highest vote is declared to be the predicted class. VFI is compared with the Naive Bayesian Classifier, which also considers each feature separately. Experiments on real-world datasets show that VFI achieves comparably and even better than NBC in terms of classification accuracy. Moreover, VFI is faster than NBC on all datasets.

In learning from examples it is often useful to expand an attribute-vector representation by intermediate concepts. The usual advantage of such structuring of the learning problem is that it makes the learning easier and improves the comprehensibility of induced descriptions. In this paper, we develop a technique for discovering useful intermediate concepts when both the class and the attributes are real-valued. The technique is based on a decomposition method originally developed for the design of switching circuits and recently extended to handle incompletely specified multi-valued functions. It was also applied to machine learning tasks. In this paper, we introduce modifications, needed to decompose real functions and to present them in symbolic form. The method is evaluated on a number of test functions. The results show that the method correctly decomposes fairly complex functions. The decomposition hierarchy does not depend on a given repertoir of basic functions (background knowledge).

The Occam's razor principle suggests that among all the correct hypotheses, the simplest hypothesis is the one which best captures the structure of the problem domain and has the highest prediction accuracy when classifying new instances. This principle is implicitly used also for dealing with noise, in order to avoid overfitting a noisy training set by rule truncation or by pruning of decision trees. This work gives a theoretical framework for the applicability of Occam's razor, developed into a procedure for eliminating noise from a training set. The results of empirical evaluation show the usefulness of the presented approach to noise elimination.

Most of the current constructive induction algorithms degrade performance as the target concept becomes larger and more complex in terms of Boolean combinations. Most are only capable of constructing relatively smaller new attributes. Though it is impossible to build a learner to learn any arbitrarily large and complex concept, there are some large and complex concepts that could be represented in a simple relation such as prototypical concepts, e.g., m-of-n, majority, etc. In this paper, we propose a new approach that combines the neural net and iterative attribute construction to learn relatively short but complex Boolean combinations and prototypical structures. We also carried a series of systematic experiments to characterize our approach.

In the subject of machine learning, a concept is a description of a cluster of the concept's instances. In order to invent a new concept, one has to discover such a cluster. The necessary tool for clustering is a metric, or pseudo-metric. Here are presented families of pseudometrics which seem well suited to such tasks. On terms and literals, we construct a new kind of metric from the substitutions which arise through subsumption. From these, it is easy to form metrics on clauses, by a technique due to F.Hausdorff. They will be applicable to generalization from sets of ground clauses, to discovery of heuristic guidance for theorem proving, and to inductive logic programming.

Confirmatory induction is based on the assumption that unknown individuals are similar to known ones, i.e. they satisfy the properties shared by known individuals. This assumption can be represented inside a non-monotonic logical framework. Accordingly, existing approaches to confirmatory induction take advantage of the machinery developed so far for non-monotonic inference. However, they are based on completion policies that are unnecessary strong for the induction purpose. The contribution of this paper is twofold: some basic requirements that any model for generalization based on confirmatory induction should satisfy are proposed. Then, a model for generalization based on Hempel's notion of confirmation is introduced. This model is rational in the sense that it satisfies the rationality postulates we exhibit; moreover, the completion principle on which this model is based captures exactly the similarity assumption, hence the model can be considered minimal as well.

In this paper, we present a system, called icc, that learns constrained logic programs containing function symbols. The particularity of our approach is to consider, as in the field of Constraint Logic Programming, a specific computation domain and to handle terms by taking into account their values in this domain. Nevertheless, an earlier version of our system was only able to learn constraints X
i=t, where X
i is a variable and t is a term. We propose here a method for learning linear constraints. It has already been a lot studied in the field of Statistical Learning Theory and for learning Oblic Decision Trees. As far as we know, the originality of our approach is to rely on a Linear Programming solver. Moreover, integrating it in icc enables to learn non linear constraints.

This paper presents a reinforcement learning algorithm designed for solving optimal control problems for which the state space and the time are continuous variables. Like Dynamic Programming methods, reinforcement learning techniques generate an optimal feed-back policy by the mean of the value function which estimates the best expectation of cumulative reward as a function of initial state. The algorithm proposed here uses finite-elements methods for approximating this function. It is composed of two dynamics: the learning dynamics, called Finite-Element Reinforcement Learning, which estimates the values at the vertices of a triangulation defined upon the state space, and the structural dynamics, which refines the triangulation inside regions where the value function is irregular. This mesh refinement algorithm intends to solve the problem of the combinatorial explosion of the number of values to be estimated. A formalism for reinforcement learning in the continuous case is proposed, the Hamilton-Jacobi-Bellman equation is stated, then the algorithm is presented and applied to a simple two-dimensional target problem.

This paper proposes an empirical study of inductive Genetic Programming with Decision Trees. An approach to development of fitness functions for efficient navigation of the search process is presented. It relies on analysis of the fitness landscape structure and suggests measuring its characteristics with statistical correlations. We demonstrate that this approach increases the global landscape correlation, and thus leads to mitigation of the search difficulties. Another claim is that the elaborated fitness functions help to produce decision trees with low syntactic complexity and high predictive accuracy.

Probabilistic Incremental Program Evolution (PIPE) is a novel technique for automatic program synthesis. We combine probability vector coding of program instructions [Schmidhuber, 1997], Population-Based Incremental Learning (PBIL) [Baluja and Caruana, 1995] and tree-coding of programs used in variants of Genetic Programming (GP) [Cramer, 1985; Koza, 1992]. PIPE uses a stochastic selection method for successively generating better and better programs according to an adaptive probabilistic prototype tree. No crossover operator is used. We compare PIPE to Koza's GP variant on a function regression problem and the 6-bit parity problem.

We present NeuroLinear, a system for extracting oblique decision rules from neural networks that have been trained for classification of patterns. Each condition of an oblique decision rule corresponds to a partition of the attribute space by a hyperplane that is not necessarily axis-parallel. Allowing a set of such hyperplanes to form the boundaries of the decision regions leads to a significant reduction in the number of rules generated while maintaining the accuracy rates of the networks. We describe the components of NeuroLinear in detail using a heart disease diagnosis problem. Our experimental results on real-world datasets show that the system is effective in extracting compact and comprehensible rules with high predictive accuracy from neural networks.

We are developing the GRG knowledge discovery system for learning decision rules from relational databases. The GRG system generalizes data, reduces the number of attributes, and generates decision rules. A subsystem of this software learns decision rules using familiar and novel rule induction techniques and uses these rules to make decisions. This paper provides an overview of GRG, describes those aspects of the system most relevant to creating and using decision rules, and compares it to other machine learning approaches.

We show that adaptive real time dynamic programming extended with the action selection strategy which chooses the best action according to the latest estimate of the cost function yields asymptotically optimal policies within finite time under the minimax optimality criterion. From this it follows that learning and exploitation do not conflict under this special optimality criterion. We relate this result to learning optimal strategies in repeated two-player zero-sum deterministic games.

We present in this paper the original notion of natural relation, a quasi order that extends the idea of generality order: it allows the sound and dynamic pruning of hypotheses that do not satisfy a property, be it completeness or correctness with respect to the training examples, or hypothesis language restriction.
Natural relations for conjunctions of such properties are characterized. Learning operators that satisfy these complex natural relations allow pruning with respect to this set of properties to take place before inappropriate hypotheses are generated.
Once the natural relation is defined that optimally prunes the search space with respect to a set of properties, we discuss the existence of ideal operators for the search space ordered by this natural relation. We have adapted the results from [vdLNC94a] on the non-existence of ideal operators to those complex natural relations. We prove those nonexistence conditions do not apply to some of those natural relations, thus overcoming the previous negative results about ideal operators for space ordered by θ-subsumption only.

Over the years, research in the field of the relationship between satisfaction and loyalty has been confronted with a number of conceptual, methodological, analytical as well as operational drawbacks. We introduce an analysis method, based on machine learning techniques. The method provides insight into the nature of the relationship between satisfaction and loyalty. In this article, building on previous research concerning brand and dealer loyalty, the relationship between satisfaction with the car, satisfaction with the dealer (sales and after-sales), brand loyalty and dealer loyalty (sales and after-sales) has been investigated. The method has been evaluated and the results are compared with the results of a frequently used method.

The dominant theme of case-based research at recent ML conferences has been on classifying cases represented by feature vectors. However, other useful tasks can be targeted, and other representations are often preferable. We review the recent literature on case-based learning, focusing on alternative performance tasks and more expressive case representations. We also highlight topics in need of additional research.

Language learning has thus far not been a hot application for machine-learning (ML) research. This limited attention for work on empirical learning of language knowledge and behaviour from text and speech data seems unjustified. After all, it is becoming apparent that empirical learning of Natural Language Processing (NLP) can alleviate NLP's all-time main problem, viz. the knowledge acquisition bottleneck: empirical ML methods such as rule induction, top down induction of decision trees, lazy learning, inductive logic programming, and some types of neural network learning, seem to be excellently suited to automatically induce exactly that knowledge that is hard to gather by hand. In this paper we address the question why NLP is an interesting application for empirical ML, and provide a brief overview of current work in this area.

Human-Agent Interaction as a specific area of Human-Computer Interaction is of primary importance for the development of systems that should cooperate with humans. The ability to learn, i.e., to adapt to preferences, abilities and behaviour of a user and to peculiarities of the task at hand, should provide for both a wider range of application and a higher degree of acceptance of agent technology. In this paper, we discuss the role of Machine Learning as a basic technology for human-agent interaction and motivate the need for interdisciplinary approaches to solve problems related to communication with artificial agents for task specification, teaching, or information retrieval purposes.

Dealing with dynamically changing domains is a very important topic in Machine Learning (ML) which has very interesting practical applications. Some attempts have already been made both in the statistical and machine learning communities to address some of the issues. In this paper we give a state of the art from the available literature in this area. We argue that a lot of further research is still needed, outline the directions that such research should go and describe the expected results. We argue also that most of the problems in this area can be solved only by interaction between the researchers of both the statistical and ML-communities.

Soil erosion is a major cause of damage to agricultural lands in many parts of the world and is of particular concern in semiarid parts of Iran. We use five machine learning techniques—Random Forest (RF), M5P, Reduced Error Pruning Tree (REPTree), Gaussian Processes (GP), and Pace Regression (PR)—under two scenarios to predict soil erodibility in the Dehgolan region, Kurdistan Province, Iran. Our models are based on a variety of soil properties, including soil texture, structure, permeability, bulk density, aggregates, organic matter, and chemical constituents. We checked the validity of the models with statistical metrics, including the coefficient of determination (R²), mean absolute error (MAE), root mean squared error (RMSE), T-tests, Taylor diagrams, and box plots. All five algorithms show a positive correlation between the soil erodibility factor (K) and silt, sand, fine sand, bulk density, and infiltration. The GP model has the highest prediction accuracy (R² = 0.843, MAE = 0.0044, RMSE = 0.0050). It outperformed the RF (R² = 0.812, MAE = 0.0050, RMSE = 0.0061), PR, (R² = 0.794, MAE = 0.0037, RMSE = 0.0052), M5P (R² = 0.781, MAE = 0.0043, RMSE = 0.0053), and REPTree (R² = 0.752, MAE = 0.0045, RMSE = 0.0056) algorithms and thus is a useful complement to studies aimed at predicting soil erodibility in areas with similar climate and soil characteristics.

Bayesian additive regression trees (BART) is a tree-based machine learning method that has been successfully applied to regression and classification problems. BART assumes regularisation priors on a set of trees that work as weak learners and is very flexible for predicting in the presence of nonlinearity and high-order interactions. In this paper, we introduce an extension of BART, called model trees BART (MOTR-BART), that considers piecewise linear functions at node levels instead of piecewise constants. In MOTR-BART, rather than having a unique value at node level for the prediction, a linear predictor is estimated considering the covariates that have been used as the split variables in the corresponding tree. In our approach, local linearities are captured more efficiently and fewer trees are required to achieve equal or better performance than BART. Via simulation studies and real data applications, we compare MOTR-BART to its main competitors. R code for MOTR-BART implementation is available at https://github.com/ebprado/MOTR-BART.

The objective of this study was to investigate the reduction of phosphorus from rice mill wastewater by using free floating aquatic plants. Four free floating aquatic plants were used for this study, namely water hyacinth, water lettuce, salvinia, and duckweed. The aquatic plants reduced the total phosphorus (TP) content up to 80% and chemical oxygen demand (COD) up to 75% within 15 days. The maximum efficiency of TP and COD reduction was observed with water lettuce followed by water hyacinth, duckweed, and salvinia. The study also aims to predict phosphorus removal by three modeling techniques, for example, linear regression (LR), artificial neural network (ANN), and M5P. Prediction has been done considering hydraulic retention time (HRT), hydraulic loading rate (HLR), and initial concentration of phosphorus (Cin) as input variables whereas the reduction rate of TP (R) has been considered as a predicted variable. ANN shows promising results as compared to M5P tree and LR modeling. The model accuracy is analyzed using three statistical evaluation parameters which are coefficient of determination (R²), root mean square error (RMSE), and means absolute error (MAE).

The paper presents a Machine Learning (ML) approach to household Electrical Energy (EE) consumption prediction. It includes: data preprocessing, feature engineering, learning a classification model, and experimental evaluation on one of the largest datasets for household EE consumption – DataPort dataset. Beside the features extracted on the historical EE consumption, we additionally analyze weather and contextual-calendar related features. We believe that the combination of multiple sources of data (calendar, weather, historical EE consumption) provides more information to the model in order to learn better performing model. The experimental results showed that in all the cases the ML algorithms outperform the baselines, with the best performing the XGBoost - achieved 0.69 RMSE score, 0.41 MAE score and 0.67 R² score which is significantly better than the best performing baseline model (the value from 24 h ago). Additionally, the results show that the largest errors are made for the weekends, which was expected due to the irregularities in the schedule - trips, vacations, etc.

ResearchGate has not been able to resolve any references for this publication.