Fig 3 - uploaded by Damien François

Content may be subject to copyright.

# Ordered differences between the chosen distance and the optimal one according to the suggested policy, compared with the default Euclidean distance and a random choice.

Source publication

One the earliest challenges a practitioner is faced with when using distance-based tools lies in the choice of the distance,
for which there often is very few information to rely on. This chapter proposes to find a compromise between an a priori unoptimized
choice (e.g. the Euclidean distance) and a fully-optimized, but computationally expensive, c...

## Context in source publication

**Context 1**

... dataset. The results for the Friedman dataset are given Table 2. The results show that for most models, the Euclidean distance is the optimal one, although the Manhattan distance performs nearly as well as the Euclidean one. Tecator dataset. The results are presented in Table 3. Except for the RBFNOrr model, the Chebyshev distance performs well. It is outperformed by the Euclidean distance with the LSSVM and SVM, but not by far. Housing dataset. As shown in Table 4, the Euclidean distance performs best in most of the cases. For some models, the Manhattan distance performs equally to the Euclidean one. Forest dataset. The results on this dataset are very good in general. All distance definitions provide similar results except for the Chebyshev distance with the LSSVM and the RBFNOrr. Hardware dataset . The results are given in Table 6. The euclidean norm appears to be the optimal choice for all models. The other distances perform nearly as well though. The model RBFNOrr performs really worse than the others ; for him the Euclidean distance is clearly the most relevant one. Concrete dataset. On this dataset, all models agree that the Euclidean norm gives the best results, although for some models, those are not significantly better than those obtained with the 1/2-norm. See Table 7 for detailed results. Housingburst dataset. For all models, best results are obtained using the 1/2-norm. Although all differences might not be statistically significant, there seem to be a clear tendency to decrease the results when the exponent in the distance definition increases (Table 8). Tecatorburst. Although less convincingly than with the Housingburst dataset, the results tend to show a decrease in performances as the exponent increases. The sole RBFNOrr behaves differently, as for this model, the 1/2-norm is the worst performing one (Table 9). Delve. For each model, except for the LSSVM, all distance measures perform equally good. It is worth noting that for the LSSVM model, both the Manhattan metric and the Chebyshev distance perform better than any other combination of model and distance definition. By contrast, the LSSVM with the Euclidean distance perform worse than all others. (Table 10). Following the suggested policy (Cfr Sec 3.4), the distance functions were chosen for each dataset as follows: Dataset 1/2-norm 1-norm 2-norm ∞ -norm Friedman Tecator Housing Forest Hardware Concrete Housingburst Tecatorburst Delve To better grasp the relevance of those choices, we compute the differences in Normalized Median Squared Error between that choice and the optimal choice over all distance definitions. The larger the difference, the worse the choice is. If the difference is zero, it simply means that the optimal choice was made. The differences are computed for all experiments ; 9 datasets, 5 models, 10 repetitions, for a total of 450 experiments. They are then sorted and plotted, to obtain a curve resembling a lift curve. The larger the area under the curve, the better are the results. Figure 3 shows such curve for the suggested policy, but also for a simpler policy that would choose the Euclidean distance by default, and a random choice policy. The proposed policy achieves better results than defaulting to the Euclidean norm. It is worth mentioning that the computational costs for obtaining the results of the nearest neighbor model were neglectable compared to the time needed for optimizing one single prediction model. The choice of an optimal metric is a choice more and more important to make in mod- eling, especially when the practitioner is facing complex data. When no prior information is available to choose the most relevant distance measure, the choice can be optimized by resorting to resampling methods, adding a layer of complexity to the usual cross-validation loops needed to learn complex models which depend on some meta-parameters, or, most often, it is defaulted to the Euclidean distance. The approach which was developed in this chapter aims at finding a compromise between both extremes, using ideas similar to the ones developed in filters methods for feature selection and landmarking approaches to meta learning. The idea is to assess each candidate distance metric using the performances of a simple nearest-neighbor model priori to building a more elaborate model based on distances (support vector machines with Gaussian kernels, radial basis function networks, and other lazy learning algorithms.) Ties are resolved by defaulting to the Euclidean distance. The experiments show that, although this approach does not allow finding the optimal metric for all datasets and all models, it proves being a reasonable heuristic providing, at a reasonable cost, hints and information about which distance metric to use. The experiments furthermore show that the choice of the optimal metric is rather constant across distance-based models, with maybe the exception of the support vector machine whose learning algorithm may suffer drawbacks from not using the Euclidean distance because not all metrics lead to Mercer kernels. In most cases, when the optimal metric was not found using the suggested approach, the results obtained using the suggested metric were close enough to the optimal ones to justify relying on the the method. Although the approach was tested in this study only for Minkowski and fractional metrics, it could be extended for other choices of the metric, and future work will consist in developing a similar approach for the choice of the kernel in kernel-based methods, a problem which is very closely related to the choice of the distance ...

## Similar publications

This propiosal presents an approach to queue management whereby the data collected from a queue management system is used to predict future trends. The aim is to provide insights useful to managers (for decision making), as well as users (for proper planning). Through the use of regression tools and time series analysis, various insights can be est...

It is well understood that the range of application for an empirical groundmotion prediction model is constrained by the range of predictor variables covered in the data used in the analysis. However, in probabilistic seismic hazard analysis (PSHA), the limits in the application of ground-motion prediction models (GMPMs) are often ignored, and the...

The paper presents algorithms for instance selection for regression problems based upon the CNN and ENN solutions known for classification tasks. A comparative experimental study is performed on several datasets using multilayer perceptrons and k-NN algorithms with different parameters and their various combinations as the method the selection is b...

This study exploits three methods, namely the Back-propagation Neural Network (BPNN), Classification and Regression Tree (CART), and Generalized Regression Neural Network (GRNN) in predicting the student's mathematics achievement. The first part of this study utilizes enrolment data to predict the student's mid-semester evaluation result, whereas t...

The strain rate effect of the bond properties between FRP laminate and concrete substrate was studied experimentally in the paper. 57 double-lap shear specimens were tested under direct shear load at strain rates of up to 100,000 με/sec. Test variables included strain rates, concrete strengths and different types of bonding adhesives and FRP compos...

## Citations

... It is a well-established principle in machine-learning that understanding the manifold structure of cases in dataspace can help guide appropriate selection of a classification model and/or geometric features that enable more accurate classification [39,40]. Data-space inter-sample distance measures are fundamental to many machinelearning algorithms such as k-Nearest-Neighbors [41] (k-NN), and in the case of k-NN, the choice of distance measure can be a key determinant of the accuracy of the classifier [42]. ...

Background
We previously reported on CERENKOV, an approach for identifying regulatory single nucleotide polymorphisms (rSNPs) that is based on 246 annotation features. CERENKOV uses the xgboost classifier and is designed to be used to find causal noncoding SNPs in loci identified by genome-wide association studies (GWAS). We reported that CERENKOV has state-of-the-art performance (by two traditional measures and a novel GWAS-oriented measure, AVGRANK) in a comparison to nine other tools for identifying functional noncoding SNPs, using a comprehensive reference SNP set (OSU17, 15,331 SNPs). Given that SNPs are grouped within loci in the reference SNP set and given the importance of the data-space manifold geometry for machine-learning model selection, we hypothesized that within-locus inter-SNP distances would have class-based distributional biases that could be exploited to improve rSNP recognition accuracy. We thus defined an intralocus SNP “radius” as the average data-space distance from a SNP to the other intralocus neighbors, and explored radius likelihoods for five distance measures.
Results
We expanded the set of reference SNPs to 39,083 (the OSU18 set) and extracted CERENKOV SNP feature data. We computed radius empirical likelihoods and likelihood densities for rSNPs and control SNPs, and found significant likelihood differences between rSNPs and control SNPs. We fit parametric models of likelihood distributions for five different distance measures to obtain ten log-likelihood features that we combined with the 248-dimensional CERENKOV feature matrix. On the OSU18 SNP set, we measured the classification accuracy of CERENKOV with and without the new distance-based features, and found that the addition of distance-based features significantly improves rSNP recognition performance as measured by AUPVR, AUROC, and AVGRANK. Along with feature data for the OSU18 set, the software code for extracting the base feature matrix, estimating ten distance-based likelihood ratio features, and scoring candidate causal SNPs, are released as open-source software CERENKOV2.
Conclusions
Accounting for the locus-specific geometry of SNPs in data-space significantly improved the accuracy with which noncoding rSNPs can be computationally identified.
Electronic supplementary material
The online version of this article (10.1186/s12859-019-2637-4) contains supplementary material, which is available to authorized users.

... , y n ) with p ? (0, 1). In practical applications experimental determination of the exact value for p can be recommended (Fran?ois et al., 2011). As an alternative to sophisticated dissimilarity measures algorithms reducing the number of features are often being employed. ...

In the paper methods aimed at handling high-dimensional weather forecasts data used to predict the concentrations of PM10, PM2.5, SO2, NO, CO and O3 are being proposed. The procedure employed to predict pollution normally requires historical data samples for a large number of points in time – particularly weather forecast data, actual weather data and pollution data. Likewise , it typically involves using numerous features related to atmospheric conditions. Consequently the analysis of such datasets to generate accurate forecasts becomes very cumbersome task. The paper examines a variety of unsupervised dimensionality reduction methods aimed at obtaining compact yet informative set of features. As an alternative, approach using fractional distances for data analysis tasks is being considered as well. Both strategies were evaluated on real-world data obtained from the Institute of Meteorology and Water Management in Katowice (Poland) , with extended Air Pollution Forecast Model (e-APFM) being used as underlying prediction tool. It was found that employing fractional distance as a dissimilarity measure ensures the best accuracy of forecasting. Satisfactory results can be also obtained with Isomap, Landmark Isomap and Factor Analysis as dimensionality reduction techniques. These methods can be also used to formulate universal mapping, ready-to-use for data gathered at different geographical areas.
Full version: http://authors.elsevier.com/a/1T1pc5c6cKexuw

... Experiments [15] also show that choosing the right fractional norm, as opposed to the Euclidean norm, could significantly improve the effectiveness of standard k– nearest neighbor (kNN) classification in high-dimensional spaces. This observation was more closely investigated by François et al. [18] who follow a supervised approach to infer the optimum ℓ p norm using labeled training data. More precisely, the authors use a simple regression model to choose an optimal norm which is then evaluated on more elaborate regression models. ...

The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different ℓp norms and hubness. We propose an unsupervised approach for choosing an ℓp norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness.

... These shifts might result from the long measurement time of the Agilent 4294A, during which the sensor characteristics might change, combined with a higher sensitivity of DF2 to small changes in the sensor signal . We carried out a leave-one-out cross-validation, using a k nearest neighbor (kNN, k = 3) Euclidian distance classifier for the different LDAs (Francois et al., 2011). The results are shown in Table 2 . ...

For the self-test of semiconductor gas sensors, we combine two multi-signal processes: temperature-cycled operation (TCO) and electrical impedance spectroscopy (EIS). This combination allows one to discriminate between irreversible changes of the sensor, i.e., changes caused by poisoning, as well as changes in the gas atmosphere. To integrate EIS and TCO, impedance spectra should be acquired in a very short time period, in which the sensor can be considered time invariant, i.e., milliseconds or less. For this purpose we developed a Fourier-based high-speed, low-cost impedance spectroscope. It provides a binary excitation signal through an FPGA (field programable gate array), which also acquires the data. To determine impedance spectra, it uses the ETFE (empirical transfer function estimate) method, which calculates the impedance by evaluating the Fourier transformations of current and voltage. With this approach an impedance spectrum over the range from 61 kHz to 100 MHz is acquired in ca. 16 μs.
We carried out TCO–EIS measurements with this spectroscope and a commercial impedance analyzer (Agilent 4294A), with a temperature cycle consisting of six equidistant temperature steps between 200 and 450 °C, with lengths of 30 s (200 °C) and 18 s (all others). Discrimination of carbon monoxide (CO) and methane (CH4) is possible by LDA (linear discriminant analysis) using either TCO or EIS data, thus enabling a validation of results by comparison of both methods.

Light food refers to healthy and nutritious food that has the characteristics of low calorie, low fat, and high fiber. Light food has been favored by the public, especially by the young generation in recent years. Moreover, affected by the COVID-19 epidemic, consumers’ awareness of a healthy diet has been improved to a certain extent. As both take-out and in-place orders for light food are growing rapidly, there are massive customer reviews left on the Meituan platform. However, massive, multi-dimensional unstructured data has not yet been fully explored. This research aims to explore the customers’ focal points and sentiment polarity of the overall comments and to investigate whether there exist differences of these two aspects before and after the COVID-19. A total of 6968 light food customer reviews on the Meituan platform were crawled and finally used for data analysis. This research first conducted the fine-grained sentiment analysis and classification of the light food customer reviews via the SnowNLP technique. In addition, LDA topic modeling was used to analyze positive and negative topics of customer reviews. The experimental results were visualized and the research showed that the SnowNLP technique and LDA topic modeling achieve high performance in extracting the customers’ sentiments and focal points, which provides theoretical and data support for light food businesses to improve customer service. This research contributes to the existing research on LDA modeling and light food customer review analysis. Several practical and feasible suggestions are further provided for managers in the light food industry.

Network security has always been facing new challenges. Accurate and convenient detection of intrusions is needed to protect system security. When the system encounters an intrusion, there may be a problem of insufficient early samples for this type of attack, resulting in a low recognition rate. It is necessary to consider whether it can be combined with a suitable intrusion detection method to realize the detection of abnormal data with only a small number of samples. In this paper, we propose an intrusion detection method based on small sample learning, which can process the intrusion behavior information so as to realize the classification of abnormal behaviors when the previous similar samples are insufficient. And ResNet is selected as the classification model to build a deeper network. We gradually increased the number of iterations and the number of small samples in the experiment, and got the performance changes of different models. Compared with CNN, SVM and other algorithms, the intrusion detection method is evaluated by performance indicators such as accuracy rate and false alarm rate. It is finally proved that ResNet can better deal with the intrusion detection classification problem under small sample data. It is more feasible and accurate, and can be widely used to determine network intrusion behavior.

In this paper, we discuss the problem of estimating the minimum error reachable by a regression model given a dataset, prior to learning. More specifically, we extend the Gamma Test estimates of the variance of the noise from the continuous case to the binary case. We give some heuristics for further possible extensions of the theory in the continuous case with the [Formula: see text]-norm and conclude with some applications and simulations. From the point of view of machine learning, the result is relevant because it gives conditions under which there is no need to learn the model in order to predict the best possible performance.

The book focuses on different variants of decision tree induction but also describes the meta-learning approach in general which is applicable to other types of machine learning algorithms. The book discusses different variants of decision tree induction and represents a useful source of information to readers wishing to review some of the techniques used in decision tree learning, as well as different ensemble methods that involve decision trees. It is shown that the knowledge of different components used within decision tree learning needs to be systematized to enable the system to generate and evaluate different variants of machine learning algorithms with the aim of identifying the top-most performers or potentially the best one. A unified view of decision tree learning enables to emulate different decision tree algorithms simply by setting certain parameters. As meta-learning requires running many different processes with the aim of obtaining performance results, a detailed description of the experimental methodology and evaluation framework is provided. Meta-learning is discussed in great detail in the second half of the book. The exposition starts by presenting a comprehensive review of many meta-learning approaches explored in the past described in literature, including for instance approaches that provide a ranking of algorithms. The approach described can be related to other work that exploits planning whose aim is to construct data mining workflows. The book stimulates interchange of ideas between different, albeit related, approaches.

The problems of learning and meta-learning have been introduced formally in Sect. 1.1. Many learning algorithms have been proposed by the CI community to solve miscellaneous problems like classification, approximation, clustering , time series prediction and others.