Fig 2 - uploaded by Damien François
Content may be subject to copyright.
Mean (plain) and standard deviation (dotted around the mean) of the Median squared error in test of all models for four distance definitions, over 10 runs. Blue: LLNKWW, Green: KKNN, Red: RBFNOrr, Cyan: LSSVM, Pink: SVM
Source publication
One the earliest challenges a practitioner is faced with when using distance-based tools lies in the choice of the distance,
for which there often is very few information to rely on. This chapter proposes to find a compromise between an a priori unoptimized
choice (e.g. the Euclidean distance) and a fully-optimized, but computationally expensive, c...
Context in source publication
Context 1
... Manhattan distance, Euclidean distance and Chebyshev distance. The boxes represent the inter-quartile range, the horizontal line inside the box is the median, and the tails represent the fifth and 95th percentile respectively. The plusses represent single outliers. Table 1 provides with the mean and standard deviation of those distributions. For Friedman and Hardware, the visual inspection of the plots seems to favor the Euclidean distance, at least in terms of median results, although the difference, might not be statistically significant. As far as dataset Tecator is concerned, the Chebyshev metric seems to be the favorite choice. The same conclusion can be seen for Delve, although this is much less obvious. For both Housingburst and Tecatorburst, the 1/2-norm seems to be preferable. The same conclusion can be drawn about Concrete while it is not as clear as for the latter. The results for Housing seem to point out the Manhattan norm as most relevant; it nevertheless does not outperform the 1/2-norm and the Euclidean norm significantly, while for the Forest dataset, all distances seem to perform equally. For those datasets, defaulting to the Euclidean distance seems also reasonable. The results of 10 runs of the random splitting are shown in Figure 2. Most of the time, the results of the different models are comparable. No model outperforms all other over all datasets. Sometimes, however, one model performs worse than the others. It is the case for instance with the RBFNOrr model on Tecator and Hardware. For Friedman and Concrete, the KKNN performs really badly with the 1/2-norm while for Delve, the SVM performs worse with the Euclidean norm than any other model on those data. Note that some models deliver poor performances (around 0.6 and above) with specific distances measures ; in those cases, the choice of the correct distance measure is ...
Similar publications
This propiosal presents an approach to queue management whereby the data collected from a queue management system is used to predict future trends. The aim is to provide insights useful to managers (for decision making), as well as users (for proper planning). Through the use of regression tools and time series analysis, various insights can be est...
It is well understood that the range of application for an empirical groundmotion prediction model is constrained by the range of predictor variables covered in the data used in the analysis. However, in probabilistic seismic hazard analysis (PSHA), the limits in the application of ground-motion prediction models (GMPMs) are often ignored, and the...
The paper presents algorithms for instance selection for regression problems based upon the CNN and ENN solutions known for classification tasks. A comparative experimental study is performed on several datasets using multilayer perceptrons and k-NN algorithms with different parameters and their various combinations as the method the selection is b...
This study exploits three methods, namely the Back-propagation Neural Network (BPNN), Classification and Regression Tree (CART), and Generalized Regression Neural Network (GRNN) in predicting the student's mathematics achievement. The first part of this study utilizes enrolment data to predict the student's mid-semester evaluation result, whereas t...
The strain rate effect of the bond properties between FRP laminate and concrete substrate was studied experimentally in the paper. 57 double-lap shear specimens were tested under direct shear load at strain rates of up to 100,000 με/sec. Test variables included strain rates, concrete strengths and different types of bonding adhesives and FRP compos...
Citations
... It is a well-established principle in machine-learning that understanding the manifold structure of cases in dataspace can help guide appropriate selection of a classification model and/or geometric features that enable more accurate classification [39,40]. Data-space inter-sample distance measures are fundamental to many machinelearning algorithms such as k-Nearest-Neighbors [41] (k-NN), and in the case of k-NN, the choice of distance measure can be a key determinant of the accuracy of the classifier [42]. ...
Background
We previously reported on CERENKOV, an approach for identifying regulatory single nucleotide polymorphisms (rSNPs) that is based on 246 annotation features. CERENKOV uses the xgboost classifier and is designed to be used to find causal noncoding SNPs in loci identified by genome-wide association studies (GWAS). We reported that CERENKOV has state-of-the-art performance (by two traditional measures and a novel GWAS-oriented measure, AVGRANK) in a comparison to nine other tools for identifying functional noncoding SNPs, using a comprehensive reference SNP set (OSU17, 15,331 SNPs). Given that SNPs are grouped within loci in the reference SNP set and given the importance of the data-space manifold geometry for machine-learning model selection, we hypothesized that within-locus inter-SNP distances would have class-based distributional biases that could be exploited to improve rSNP recognition accuracy. We thus defined an intralocus SNP “radius” as the average data-space distance from a SNP to the other intralocus neighbors, and explored radius likelihoods for five distance measures.
Results
We expanded the set of reference SNPs to 39,083 (the OSU18 set) and extracted CERENKOV SNP feature data. We computed radius empirical likelihoods and likelihood densities for rSNPs and control SNPs, and found significant likelihood differences between rSNPs and control SNPs. We fit parametric models of likelihood distributions for five different distance measures to obtain ten log-likelihood features that we combined with the 248-dimensional CERENKOV feature matrix. On the OSU18 SNP set, we measured the classification accuracy of CERENKOV with and without the new distance-based features, and found that the addition of distance-based features significantly improves rSNP recognition performance as measured by AUPVR, AUROC, and AVGRANK. Along with feature data for the OSU18 set, the software code for extracting the base feature matrix, estimating ten distance-based likelihood ratio features, and scoring candidate causal SNPs, are released as open-source software CERENKOV2.
Conclusions
Accounting for the locus-specific geometry of SNPs in data-space significantly improved the accuracy with which noncoding rSNPs can be computationally identified.
Electronic supplementary material
The online version of this article (10.1186/s12859-019-2637-4) contains supplementary material, which is available to authorized users.
... , y n ) with p ? (0, 1). In practical applications experimental determination of the exact value for p can be recommended (Fran?ois et al., 2011). As an alternative to sophisticated dissimilarity measures algorithms reducing the number of features are often being employed. ...
In the paper methods aimed at handling high-dimensional weather forecasts data used to predict the concentrations of PM10, PM2.5, SO2, NO, CO and O3 are being proposed. The procedure employed to predict pollution normally requires historical data samples for a large number of points in time – particularly weather forecast data, actual weather data and pollution data. Likewise , it typically involves using numerous features related to atmospheric conditions. Consequently the analysis of such datasets to generate accurate forecasts becomes very cumbersome task. The paper examines a variety of unsupervised dimensionality reduction methods aimed at obtaining compact yet informative set of features. As an alternative, approach using fractional distances for data analysis tasks is being considered as well. Both strategies were evaluated on real-world data obtained from the Institute of Meteorology and Water Management in Katowice (Poland) , with extended Air Pollution Forecast Model (e-APFM) being used as underlying prediction tool. It was found that employing fractional distance as a dissimilarity measure ensures the best accuracy of forecasting. Satisfactory results can be also obtained with Isomap, Landmark Isomap and Factor Analysis as dimensionality reduction techniques. These methods can be also used to formulate universal mapping, ready-to-use for data gathered at different geographical areas.
Full version: http://authors.elsevier.com/a/1T1pc5c6cKexuw
... Experiments [15] also show that choosing the right fractional norm, as opposed to the Euclidean norm, could significantly improve the effectiveness of standard k– nearest neighbor (kNN) classification in high-dimensional spaces. This observation was more closely investigated by François et al. [18] who follow a supervised approach to infer the optimum ℓ p norm using labeled training data. More precisely, the authors use a simple regression model to choose an optimal norm which is then evaluated on more elaborate regression models. ...
The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different ℓp norms and hubness. We propose an unsupervised approach for choosing an ℓp norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness.
... These shifts might result from the long measurement time of the Agilent 4294A, during which the sensor characteristics might change, combined with a higher sensitivity of DF2 to small changes in the sensor signal . We carried out a leave-one-out cross-validation, using a k nearest neighbor (kNN, k = 3) Euclidian distance classifier for the different LDAs (Francois et al., 2011). The results are shown in Table 2 . ...
For the self-test of semiconductor gas sensors, we combine two multi-signal processes: temperature-cycled operation (TCO) and electrical impedance spectroscopy (EIS). This combination allows one to discriminate between irreversible changes of the sensor, i.e., changes caused by poisoning, as well as changes in the gas atmosphere. To integrate EIS and TCO, impedance spectra should be acquired in a very short time period, in which the sensor can be considered time invariant, i.e., milliseconds or less. For this purpose we developed a Fourier-based high-speed, low-cost impedance spectroscope. It provides a binary excitation signal through an FPGA (field programable gate array), which also acquires the data. To determine impedance spectra, it uses the ETFE (empirical transfer function estimate) method, which calculates the impedance by evaluating the Fourier transformations of current and voltage. With this approach an impedance spectrum over the range from 61 kHz to 100 MHz is acquired in ca. 16 μs.
We carried out TCO–EIS measurements with this spectroscope and a commercial impedance analyzer (Agilent 4294A), with a temperature cycle consisting of six equidistant temperature steps between 200 and 450 °C, with lengths of 30 s (200 °C) and 18 s (all others). Discrimination of carbon monoxide (CO) and methane (CH4) is possible by LDA (linear discriminant analysis) using either TCO or EIS data, thus enabling a validation of results by comparison of both methods.
Light food refers to healthy and nutritious food that has the characteristics of low calorie, low fat, and high fiber. Light food has been favored by the public, especially by the young generation in recent years. Moreover, affected by the COVID-19 epidemic, consumers’ awareness of a healthy diet has been improved to a certain extent. As both take-out and in-place orders for light food are growing rapidly, there are massive customer reviews left on the Meituan platform. However, massive, multi-dimensional unstructured data has not yet been fully explored. This research aims to explore the customers’ focal points and sentiment polarity of the overall comments and to investigate whether there exist differences of these two aspects before and after the COVID-19. A total of 6968 light food customer reviews on the Meituan platform were crawled and finally used for data analysis. This research first conducted the fine-grained sentiment analysis and classification of the light food customer reviews via the SnowNLP technique. In addition, LDA topic modeling was used to analyze positive and negative topics of customer reviews. The experimental results were visualized and the research showed that the SnowNLP technique and LDA topic modeling achieve high performance in extracting the customers’ sentiments and focal points, which provides theoretical and data support for light food businesses to improve customer service. This research contributes to the existing research on LDA modeling and light food customer review analysis. Several practical and feasible suggestions are further provided for managers in the light food industry.
Network security has always been facing new challenges. Accurate and convenient detection of intrusions is needed to protect system security. When the system encounters an intrusion, there may be a problem of insufficient early samples for this type of attack, resulting in a low recognition rate. It is necessary to consider whether it can be combined with a suitable intrusion detection method to realize the detection of abnormal data with only a small number of samples. In this paper, we propose an intrusion detection method based on small sample learning, which can process the intrusion behavior information so as to realize the classification of abnormal behaviors when the previous similar samples are insufficient. And ResNet is selected as the classification model to build a deeper network. We gradually increased the number of iterations and the number of small samples in the experiment, and got the performance changes of different models. Compared with CNN, SVM and other algorithms, the intrusion detection method is evaluated by performance indicators such as accuracy rate and false alarm rate. It is finally proved that ResNet can better deal with the intrusion detection classification problem under small sample data. It is more feasible and accurate, and can be widely used to determine network intrusion behavior.
In this paper, we discuss the problem of estimating the minimum error reachable by a regression model given a dataset, prior to learning. More specifically, we extend the Gamma Test estimates of the variance of the noise from the continuous case to the binary case. We give some heuristics for further possible extensions of the theory in the continuous case with the [Formula: see text]-norm and conclude with some applications and simulations. From the point of view of machine learning, the result is relevant because it gives conditions under which there is no need to learn the model in order to predict the best possible performance.
The book focuses on different variants of decision tree induction but also describes the meta-learning approach in general which is applicable to other types of machine learning algorithms. The book discusses different variants of decision tree induction and represents a useful source of information to readers wishing to review some of the techniques used in decision tree learning, as well as different ensemble methods that involve decision trees. It is shown that the knowledge of different components used within decision tree learning needs to be systematized to enable the system to generate and evaluate different variants of machine learning algorithms with the aim of identifying the top-most performers or potentially the best one. A unified view of decision tree learning enables to emulate different decision tree algorithms simply by setting certain parameters. As meta-learning requires running many different processes with the aim of obtaining performance results, a detailed description of the experimental methodology and evaluation framework is provided. Meta-learning is discussed in great detail in the second half of the book. The exposition starts by presenting a comprehensive review of many meta-learning approaches explored in the past described in literature, including for instance approaches that provide a ranking of algorithms. The approach described can be related to other work that exploits planning whose aim is to construct data mining workflows. The book stimulates interchange of ideas between different, albeit related, approaches.
The problems of learning and meta-learning have been introduced formally in Sect. 1.1. Many learning algorithms have been proposed by the CI community to solve miscellaneous problems like classification, approximation, clustering , time series prediction and others.