# Ordered differences between the chosen distance and the optimal one according to the suggested policy, compared with the default Euclidean distance and a random choice.

One the earliest challenges a practitioner is faced with when using distance-based tools lies in the choice of the distance,
for which there often is very few information to rely on. This chapter proposes to find a compromise between an a priori unoptimized
choice (e.g. the Euclidean distance) and a fully-optimized, but computationally expensive, c...

... dataset. The results for the Friedman dataset are given Table 2. The results show that for most models, the Euclidean distance is the optimal one, although the Manhattan distance performs nearly as well as the Euclidean one. Tecator dataset. The results are presented in Table 3. Except for the RBFNOrr model, the Chebyshev distance performs well. It is outperformed by the Euclidean distance with the LSSVM and SVM, but not by far. Housing dataset. As shown in Table 4, the Euclidean distance performs best in most of the cases. For some models, the Manhattan distance performs equally to the Euclidean one. Forest dataset. The results on this dataset are very good in general. All distance definitions provide similar results except for the Chebyshev distance with the LSSVM and the RBFNOrr. Hardware dataset . The results are given in Table 6. The euclidean norm appears to be the optimal choice for all models. The other distances perform nearly as well though. The model RBFNOrr performs really worse than the others ; for him the Euclidean distance is clearly the most relevant one. Concrete dataset. On this dataset, all models agree that the Euclidean norm gives the best results, although for some models, those are not significantly better than those obtained with the 1/2-norm. See Table 7 for detailed results. Housingburst dataset. For all models, best results are obtained using the 1/2-norm. Although all differences might not be statistically significant, there seem to be a clear tendency to decrease the results when the exponent in the distance definition increases (Table 8). Tecatorburst. Although less convincingly than with the Housingburst dataset, the results tend to show a decrease in performances as the exponent increases. The sole RBFNOrr behaves differently, as for this model, the 1/2-norm is the worst performing one (Table 9). Delve. For each model, except for the LSSVM, all distance measures perform equally good. It is worth noting that for the LSSVM model, both the Manhattan metric and the Chebyshev distance perform better than any other combination of model and distance definition. By contrast, the LSSVM with the Euclidean distance perform worse than all others. (Table 10). Following the suggested policy (Cfr Sec 3.4), the distance functions were chosen for each dataset as follows: Dataset 1/2-norm 1-norm 2-norm ∞ -norm Friedman Tecator Housing Forest Hardware Concrete Housingburst Tecatorburst Delve To better grasp the relevance of those choices, we compute the differences in Normalized Median Squared Error between that choice and the optimal choice over all distance definitions. The larger the difference, the worse the choice is. If the difference is zero, it simply means that the optimal choice was made. The differences are computed for all experiments ; 9 datasets, 5 models, 10 repetitions, for a total of 450 experiments. They are then sorted and plotted, to obtain a curve resembling a lift curve. The larger the area under the curve, the better are the results. Figure 3 shows such curve for the suggested policy, but also for a simpler policy that would choose the Euclidean distance by default, and a random choice policy. The proposed policy achieves better results than defaulting to the Euclidean norm. It is worth mentioning that the computational costs for obtaining the results of the nearest neighbor model were neglectable compared to the time needed for optimizing one single prediction model. The choice of an optimal metric is a choice more and more important to make in mod- eling, especially when the practitioner is facing complex data. When no prior information is available to choose the most relevant distance measure, the choice can be optimized by resorting to resampling methods, adding a layer of complexity to the usual cross-validation loops needed to learn complex models which depend on some meta-parameters, or, most often, it is defaulted to the Euclidean distance. The approach which was developed in this chapter aims at finding a compromise between both extremes, using ideas similar to the ones developed in filters methods for feature selection and landmarking approaches to meta learning. The idea is to assess each candidate distance metric using the performances of a simple nearest-neighbor model priori to building a more elaborate model based on distances (support vector machines with Gaussian kernels, radial basis function networks, and other lazy learning algorithms.) Ties are resolved by defaulting to the Euclidean distance. The experiments show that, although this approach does not allow finding the optimal metric for all datasets and all models, it proves being a reasonable heuristic providing, at a reasonable cost, hints and information about which distance metric to use. The experiments furthermore show that the choice of the optimal metric is rather constant across distance-based models, with maybe the exception of the support vector machine whose learning algorithm may suffer drawbacks from not using the Euclidean distance because not all metrics lead to Mercer kernels. In most cases, when the optimal metric was not found using the suggested approach, the results obtained using the suggested metric were close enough to the optimal ones to justify relying on the the method. Although the approach was tested in this study only for Minkowski and fractional metrics, it could be extended for other choices of the metric, and future work will consist in developing a similar approach for the choice of the kernel in kernel-based methods, a problem which is very closely related to the choice of the distance ...

