ArticlePDF Available

On the Surprising Behavior of Distance Metric in High-Dimensional Space

Authors:

Abstract and Figures

In recent years, the effect of the curse of high dimensionality has been studied in great detail on several problems such as clustering, nearest neighbor search, and indexing. In high dimensional space the data becomes sparse, and traditional indexing and algorithmic techniques fail from a efficiency and/or effectiveness perspective. Recent research results show that in high dimensional space, the concept of proximity, distance or nearest neighbor may not even be qualitatively meaningful. In this paper, we view the dimensionality curse from the point of view of the distance metrics which are used to measure the similarity between objects. We specially examine the behavior of the commonly used Lk norm and show that the problem of meaningfulness in high dimensionality is sensitive to the value of k. For example, this means that the Manhattan distance metric (L1 norm) is consistently more preferable than the Euclidean distance metric (L2 norm) for high dimensional data mining applications. Using the intuition derived from our analysis, we introduce and examine a natural extension of the Lk norm to fractional distance metrics. We show that the fractional distance metric provides more meaningful results both from the theoretical and empirical perspective. The results show that fractional distance metrics can significantly improve the effectiveness of standard clustering algorithms such as the k-means algorithm.
Content may be subject to copyright.
On the Surprising Behavior of Distance Metrics
in High Dimensional Space
Charu C. Aggarwal
1
, Alexander Hinneburg
2
, and Daniel A. Keim
2
1
IBM T. J. Watson Research Center
Yorktown Heights, NY 10598, USA.
charu@watson.ibm.com
2
Institute of Computer Science, University of Halle
Kurt-Mothes-Str.1, 06120 Halle (Saale), Germany
f
hinneburg, keim
g
@informatik.uni-halle.de
Abstract.
In recent years, the eect of the curse of high dimensionality
has been studied in great detail on several problems such as clustering,
nearest neighbor search, and indexing. In high dimensional space the data
becomes sparse, and traditional indexing and algorithmic techniques fail
from a eciency and/or eectiveness perspective. Recent research results
show that in high dimensional space, the concept of proximity, distance
or nearest neighbor may not even be qualitatively meaningful. In this
paper, we view the dimensionality curse from the point of view of the
distance metrics which are used to measure the similarity between ob-
jects. We specically examine the behavior of the commonly used
L
k
norm and show that the problem of meaningfulness in high dimension-
ality is sensitive to the value of
k
. For example, this means that the
Manhattan distance metric (
L
1
norm) is consistently more preferable
than the Euclidean distance metric (
L
2
norm) for high dimensional data
mining applications. Using the intuition derived from our analysis, we
introduce and examine a natural extension of the
L
k
norm to fractional
distance metrics. We show that the fractional distance metric provides
more meaningful results both from the theoretical and empirical perspec-
tive. The results show that fractional distance metrics can signicantly
improve the eectiveness of standard clustering algorithms such as the
k-means algorithm.
1 Introduction
In recent years, high dimensional search and retrieval have become very well
studied problems because of the increased importance of data mining applica-
tions [1], [2], [3], [4], [5], [8], [10], [11]. Typically, most real applications which
require the use of such techniques comprise very high dimensional data. For such
applications, the curse of high dimensionality tends to b e a ma jor obstacle in the
development of data mining techniques in several ways. For example, the per-
formance of similarity indexing structures in high dimensions degrades rapidly,
so that each query requires the access of almost all the data [1].
It has been argued in [6], that under certain reasonable assumptions on the
data distribution, the ratio of the distances of the nearest and farthest neighbors
to a given target in high dimensional space is almost 1 for a wide variety of data
distributions and distance functions. In such a case, the nearest neighbor problem
becomes ill dened, since the contrast between the distances to dierent data
points does not exist. In such cases, even the concept of proximity may not
be meaningful from a qualitative persp ective: a problem which is even more
fundamental than the performance degradation of high dimensional algorithms.
In most high dimensional applications the choice of the distance metric is
not obvious; and the notion for the calculation of similarity is very heuristical.
Given the non-contrasting nature of the distribution of distances to a given
query point, dierent measures may provide very dierent orders of proximity
of points to a given query point. There is very little literature on providing
guidance for choosing the correct distance measure which results in the most
meaningful notion of proximity between two records. Many high dimensional
indexing structures and algorithms use the euclidean distance metric as a natural
extension of its traditional use in two- or three-dimensional spatial applications.
In this paper, we discuss the general behavior of the commonly used
L
k
norm
(
x; y
2 R
d
; k
2 Z
;
L
k
(
x; y
) =
P
d
i
=1
(
k
x
i
?
y
i
k
k
)
1
=k
) in high dimensional space.
The
L
k
norm distance function is also susceptible to the dimensionality curse
for many classes of data distributions [6]. Our recent results [9] seem to suggest
that the
L
k
-norm may be more relevant for
k
= 1 or 2 than values of
k
3. In
this paper, we provide some surprising theoretical and experimental results in
analyzing the dependency of the
L
k
norm on the value of
k
. More specically,
we show that the relative contrasts of the distances to a query point depend
heavily on the
L
k
metric used. This provides considerable evidence that the
meaningfulness of the
L
k
norm worsens faster with increasing dimensionality for
higher values of
k
. Thus, for a given problem with a xed (high) value of the
dimensionality
d
, it may be preferable to use lower values of
k
. This means that
the
L
1
distance metric (Manhattan Distance metric) is the most preferable for
high dimensional applications, followed by the Euclidean Metric (
L
2
), then the
L
3
metric, and so on. Encouraged by this trend, we examine the behavior of
fractional
distance metrics, in which
k
is allowed to be a fraction smaller than 1.
We show that this metric is even more eective at preserving the meaningfulness
of proximity measures. We back up our theoretical results with empirical tests on
real and synthetic data showing that the results provided by fractional distance
metrics are indeed practically useful. Thus, the results of this paper have strong
implications for the choice of distance metrics for high dimensional data mining
problems. We specically show the improvements which can be obtained by
applying fractional distance metrics to the standard k-means algorithm.
This paper is organized as follows. In the next section, we provide a theo-
retical analysis of the behavior of the
L
k
norm in very high dimensionality. In
section 3, we discuss fractional distance metrics and provide a theoretical anal-
ysis of their behavior. In section 4, we provide the empirical results, and section
5 provides summary and conclusions.
2 Behavior of the
L
k
-norm in High Dimensionality
In order to present our convergence results, we rst establish some notations and
denitions in Table 1.
Table 1.
Notations and Basic Denitions
Notation Denition
d
Dimensionality of the data space
N
Number of data p oints
F
1-dimensional data distribution in (0
;
1)
X
d
Data p oint from
F
d
with each coordinate drawn from
F
dist
k
d
(
x; y
)Distance b etween (
x
1
;:::x
d
) and (
y
1
;:::y
d
)
using
L
k
metric =
P
d
i
=1
[(
x
i
1
?
x
i
2
)
k
]
1
=k
k k
k
Distance of a vector to the origin (0
;:::;
0)
using the function
dist
k
d
(
;
)
Dmax
k
d
= max
fk
X
d
k
k
g
Farthest distance of the
N
points
to the origin using the distance metric
L
k
Dmin
k
d
= min
fk
X
d
k
k
g
Nearest distance of the
N
points
to the origin using the distance metric
L
k
E
[
X
],
var
[
X
] Expected value and variance of a random variable
X
Y
d
!
p
c
A vector sequence
Y
1
;:::;Y
d
converges in probability to a
constant vector
c
if:
8
>
0
lim
d
!1
P
[
dist
d
(
Y
d
; c
)
] = 1
Theorem 1. Beyer et. al. (Adapted for
L
k
metric)
If
lim
d
!1
var
k
X
d
k
k
E
[
k
X
d
k
k
]
= 0
, then
Dmax
k
d
?
Dmin
k
d
Dmin
k
d
!
p
0
.
Proof.
See [6] for proof of a more general version of this result.
The result of the theorem [6] shows that the dierence between the maxi-
mum and minimum distances to a given query point
1
does not increase as fast
as the nearest distance to any point in high dimensional space. This makes a
proximity query meaningless and unstable because there is p o or discrimination
between the nearest and furthest neighbor. Henceforth, we will refer to the ratio
Dmax
k
d
?
Dmin
k
d
Dmin
k
d
as
the relative contrast
.
The results in [6] use the value of
Dmax
k
d
?
Dmin
k
d
Dmin
k
d
as an interesting criterion
for meaningfulness. In order to provide more insight, in the following we analyze
the behavior for dierent distance metrics in high-dimensional space. We rst
assume a uniform distribution of data points and show our results for
N
= 2
1
In this paper, we consistently use the origin as the query point. This choice does not
aect the generality of our results, though it simplies our algebra considerably.
points. Then, we generalize the results to an arbitrary number of points and
arbitrary distributions.
Lemma 1.
Let
F
be uniform distribution of
N
= 2
points. For an
L
k
metric,
lim
d
!1
E
h
Dmax
k
d
?
Dmin
k
d
d
1
=k
?
1
=
2
i
=
C
1
(
k
+1)
1
=k
r
1
2
k
+1
, where
C
is some con-
stant.
Proof.
Let
A
d
and
B
d
be the two p oints in a
d
dimensional data distribu-
tion such that each coordinate is independently drawn from a 1-dimensional
data distribution
F
with nite mean and standard deviation. Specically
A
d
=
(
P
1
: : : P
d
) and
B
d
= (
Q
1
:::Q
d
) with
P
i
and
Q
i
being drawn from
F
. Let
P A
d
=
f
P
d
i
=1
(
P
i
)
k
g
1
=k
be the distance of
A
d
to the origin using the
L
k
metric
and
P B
d
=
f
P
d
i
=1
(
Q
i
)
k
g
1
=k
the distance of
B
d
. The dierence of distances is
P A
d
?
P B
d
=
f
P
d
i
=1
(
P
i
)
k
g
1
=k
? f
P
d
i
=1
(
Q
i
)
k
g
1
=k
.
It can be shown
2
that the random variable (
P
i
)
k
has mean
1
k
+1
and standard
deviation
k
k
+1
r
1
2
k
+1
. This means that
(
P A
d
)
k
d
!
p
1
(
k
+1)
;
(
P B
d
)
k
d
!
p
1
(
k
+1)
and therefore
P A
d
d
1
=k
!
p
1
k
+ 1
1
=k
;
P B
d
d
1
=k
!
p
1
k
+ 1
1
=k
(1)
We intend to show that
j
P A
d
?
P B
d
j
d
1
=k
?
1
=
2
!
p
1
(
k
+1)
1
=k
r
2
2
k
+1
. We can express
j
P A
d
?
P B
d
j
in the following numerator/denominator form which we will use in
order to examine the convergence behavior of the numerator and denominator
individually.
j
P A
d
?
P B
d
j
=
j
(
P A
d
)
k
?
(
P B
d
)
k
j
P
k
?
1
r
=0
(
P A
d
)
k
?
r
?
1
(
P B
d
)
r
(2)
Dividing both sides by
d
1
=k
?
1
=
2
and regrouping the right-hand-side we get:
j
P A
d
?
P B
d
j
d
1
=k
?
1
=
2
=
j
((
P A
d
)
k
?
(
P B
d
)
k
)
j
=
p
d
P
k
?
1
r
=0
?
P A
d
d
1
=k
k
?
r
?
1
?
P B
d
d
1
=k
r
(3)
Consequently, using Slutsky's theorem
3
and the results of Equation 1 we obtain
k
?
1
X
r
=0
P A
d
d
1
=k
k
?
r
?
1
P B
d
d
1
=k
r
!
p
k
1
k
+ 1
(
k
?
1)
=k
(4)
Having characterized the convergence behavior of the denominator of the right
hand side of Equation 3, let us now examine the behavior of the numerator:
j
(
P A
d
)
k
?
(
P B
d
)
k
j
=
p
d
=
j
P
d
i
=1
((
P
i
)
k
?
(
Q
i
)
k
)
j
=
p
d
=
j
P
d
i
=1
R
i
j
=
p
d
. Here
2
This is because
E
[
P
k
i
] = 1
=
(
k
+ 1) and
E
[
P
2
k
i
] = 1
=
(2
k
+ 1).
3
Slutsky's Theorem:
Let
Y
1
:::Y
d
:::
be a sequence of random vectors and
h
(
) b e
a continuous function. If
Y
d
!
p
c
then
h
(
Y
d
)
!
p
h
(
c
).
R
i
is the new random variable dened by ((
P
i
)
k
?
(
Q
i
)
k
)
8
i
2 f
1
;:::d
g
. This
random variable has zero mean and standard deviation which is
p
2
where
is the standard deviation of (
P
i
)
k
. The sum of dierent values of
R
i
over
d
dimensions will converge to a normal distribution with mean 0 and standard
deviation
p
2
p
d
because of the central limit theorem. Consequently, the
mean average deviation of this distribution will b e
C
for some constant
C
.
Therefore, we have:
lim
d
!1
E
j
(
P A
d
)
k
?
(
P B
d
)
k
j
p
d
=
C
k
k
+ 1
r
1
2
k
+ 1
(5)
Since the denominator of Equation 3 shows probabilistic convergence, we can
combine the results of Equations 4 and 5 to obtain
lim
d
!1
E
j
P A
d
?
P B
d
j
d
1
=k
?
1
=
2
=
C
1
(
k
+ 1)
1
=k
r
1
2
k
+ 1
(6)
We can easily generalize the result for a database of
N
uniformly distributed
points. The following Corollary provides the result.
Corollary 1.
Let
F
be the uniform distribution of
N
=
n
points. Then,
C
(
k
+1)
1
=k
r
1
2
k
+1
lim
d
!1
E
h
Dmax
k
d
?
Dmin
k
d
d
1
=k
?
1
=
2
i
C
(
n
?
1)
(
k
+1)
1
=k
r
1
2
k
+1
.
Proof.
This is because if
L
is the expected dierence between the maximum and
minimum of two randomly drawn points, then the same value for
n
points drawn
from the same distribution must be in the range (
L;
(
n
?
1)
L
).
The results can be modied for arbitrary distributions of
N
points in a
database by introducing the constant factor
C
k
. In that case, the general de-
pendency of
D
max
?
D
min
on
d
1
k
?
1
2
remains unchanged. A detailed pro of is
provided in the Appendix; a short outline of the reasoning behind the result is
available in [9].
Lemma 2.
[9] Let
F
be an arbitrary distribution of
N
= 2
points. Then,
lim
d
!1
E
h
Dmax
k
d
?
Dmin
k
d
d
1
=k
?
1
=
2
i
=
C
k
, where
C
k
is some constant dependent on
k
.
Corollary 2.
Let
F
be the arbitrary distribution of
N
=
n
points. Then,
C
k
lim
d
!1
E
Dmax
k
d
?
Dmin
k
d
d
1
=k
?
1
=
2
(
n
?
1)
C
k
.
Thus, this result shows that in high dimensional space
Dmax
k
d
?
Dmin
k
d
increases at the rate of
d
1
=k
?
1
=
2
, independent of the data distribution. This
means that for the manhattan distance metric, the value of this expression di-
verges to
1
; for the Euclidean distance metric, the expression is bounded by
constants whereas for all other distance metrics, it converges to 0 (see Figure
1). Furthermore, the convergence is faster when the value of
k
of the
L
k
metric
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
20 40 60 80 100 120 140 160 180 200
p=2
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
20 40 60 80 100 120 140 160 180 200
p=2
0
5
10
15
20
25
20 40 60 80 100 120 140 160 180 200
p=1
(a)
k
= 3 (b)
k
= 2 (c)
k
= 1
0
50
100
150
200
250
300
350
400
20 40 60 80 100 120 140 160 180 200
p=2/3
0
2e+06
4e+06
6e+06
8e+06
1e+07
1.2e+07
1.4e+07
1.6e+07
20 40 60 80 100 120 140 160 180 200
p=2/5
(d)
k
= 2
=
3 (e)
k
= 2
=
5
Fig. 1.
j
Dmax
?
Dmin
j
depending on
d
for dierent metrics (uniform data)
Table 2.
Eect of dimensionality on relative (
L
1
and
L
2
) behavior of relative contrast
Dimensionality
P
[
U
d
< T
d
]
Dimensionality
P
[
U
d
< T
d
]
1 Both metrics are the same 10 95
:
6%
2 85
:
0% 15 96
:
1%
3 88
:
7% 20 97
:
1%
4 91
:
3% 100 98
:
2%
increases. This provides the insight that higher norm parameters provide poorer
contrast b etween the furthest and nearest neighbor. Even more insight may be
obtained by examining the exact behavior of the relative contrast as opposed to
the absolute distance between the furthest and nearest point.
Theorem 2.
Let
F
be the uniform distribution of
N
= 2
points. Then,
lim
d
!1
E
h
Dmax
k
d
?
Dmin
k
d
Dmin
k
d
p
d
i
=
C
q
1
2
k
+1
.
Proof.
Let
A
d
,
B
d
,
P
1
:::P
d
,
Q
1
:::Q
d
,
P A
d
,
P B
d
be dened as in the proof
of Lemma 1. We have shown in the proof of the previous result that
P A
d
d
1
=k
!
1
k
+1
1
=k
. Using Slutsky's theorem we can derive that:
min
f
P A
d
d
1
=k
;
P B
d
d
1
=k
g !
1
k
+ 1
1
=k
(7)
We have also shown in the previous result that:
lim
d
!1
E
j
P A
d
?
P B
d
j
d
1
=k
?
1
=
2
=
C
1
(
k
+ 1)
1
=k
s
1
2
k
+ 1
(8)
We can combine the results in Equation 7 and 8 to obtain:
lim
d
!1
E
p
d
j
P A
d
?
P B
d
j
min
f
P A
d
; P B
d
g
=
C
p
1
=
(2
k
+ 1) (9)
0 1 2 3 4 5 6 7 8 9 10
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5 RELATIVE CONTRAST FOR UNIFORM DISTRIBUTION
PARAMETER OF DISTANCE NORM
RELATIVE CONTRAST
N=10,000
N=1,000
N=100
Fig. 2.
Relative contrast variation with
norm parameter for the uniform distribu-
tion
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.8 -0.6 -0.4-0.2 0 0.2 0.4 0.6 0.8 1
f=1
f=0.75
f=0.5
f=0.25
Fig. 3.
Unit spheres for dierent frac-
tional metrics (2D)
Note that the above results conrm of the results in [6] because it shows that
the relative contrast degrades as 1
=
p
d
for the dierent distance norms. Note
that for values of
d
in the reasonable range of data mining applications, the
norm dependent factor of
p
1
=
(2
k
+ 1) may play a valuable role in aecting
the relative contrast. For such cases, even the relative rate of degradation of
the dierent distance metrics for a given data set in the same value of the
dimensionality may be important. In the Figure 2 we have illustrated the relative
contrast created by an articially generated data set drawn from a uniform
distribution in
d
= 20 dimensions. Clearly, the relative contrast decreases with
increasing value of
k
and also follows the same trend as
p
1
=
(2
k
+ 1).
Another interesting aspect which can be explored to improve nearest neigh-
bor and clustering algorithms in high-dimensional space is the eect of
k
on the
relative contrast. Even though the expected relative contrast always decreases
with increasing dimensionality, this may not necessarily be true for a given data
set and dierent
k
. To show this, we performed the following experiment on the
Manhattan (
L
1
) and Euclidean (
L
2
) distance metric: Let
U
d
=
Dmax
2
d
?
Dmin
2
d
Dmin
2
d
and
T
d
=
Dmax
1
d
?
Dmin
1
d
Dmin
1
d
. We performed some empirical tests to calculate the
value of
P
[
U
d
< T
d
] for the case of the Manhattan (
L
1
) and Euclidean (
L
2
) dis-
tance metrics for
N
= 10 points drawn from a uniform distribution. In each trial,
U
d
and
T
d
were calculated from the same set of
N
= 10 points, and
P
[
U
d
< T
d
]
was calculated by nding the fraction of times
U
d
was less than
T
d
in 1000 trials.
The results of the experiment are given in Table 2. It is clear that with increasing
dimensionality
d
, the value of
P
[
U
d
< T
d
] continues to increase.
Thus, for higher
dimensionality, the relative contrast provided by a norm with smaller parameter
k
is more likely to dominate another with a larger parameter.
For dimension-
alities of 20 or higher it is clear that the manhattan distance metric provides
a signicantly higher relative contrast than the Euclidean distance metric with
very high probability. Thus, among the distance metrics with integral norms,
the manhattan distance metric is the method of choice for providing the best
contrast between the dierent points. This result of our analysis can be directly
used in a number of dierent applications.
3 Fractional Distance Metrics
The result of the previous section that the Manhattan metric (
k
= 1) provides
the best discrimination in high-dimensional data spaces is the motivation for
looking into distance metrics with
k <
1. We call these metrics fractional distance
metrics. A
fractional distance metric
dist
f
d
(
L
f
norm) for
f
2
(0
;
1) is dened
as:
dist
f
d
(
x; y
) =
d
X
i
=1
(
x
i
?
y
i
)
f
1
=f
:
To give a intuition of the behavior of the fractional distance metric we plotted
in Figure 3 the unit spheres for dierent fractional metrics in
R
2
.
We will prove most of our results in this section assuming that
f
is of the form
1
=l
, where
l
is some integer. The reason that we show the results for this sp ecial
case is that we are able to use nice algebraic tricks for the proofs. The natural
conjecture from the smooth continuous variation of
dist
f
d
with
f
is that the
results are also true for arbitrary values of
f
.
4
. Our results provide considerable
insights into the behavior of the fractional distance metric and its relationship
with the
L
k
-norm for integral values of
k
.
Lemma 3.
Let
F
be the uniform distribution of
N
= 2
points and
f
= 1
=l
for
some integer
l
. Then,
lim
d
!1
E
Dmax
f
d
?
Dmin
f
d
d
1
=f
?
1
=
2
=
C
1
(
f
+1)
1
=f
r
1
2
f
+1
.
Proof.
Let
A
d
,
B
d
,
P
1
: : : P
d
,
Q
1
:::Q
d
,
P A
d
,
P B
d
be dened using the
L
f
metric
as they were dened in Lemma 1 for the
L
k
metric. Let further
QA
d
= (
P A
d
)
f
=
(
P A
d
)
1
=l
=
P
d
i
=1
(
P
i
)
f
and
QB
d
= (
P B
d
)
f
= (
P B
d
)
1
=l
=
P
d
i
=1
(
Q
i
)
f
. Analo-
gous to Lemma 1,
QA
d
d
!
p
1
f
+1
;
QB
d
d
!
p
1
f
+1
.
We intend to show that
E
h
j
P A
d
?
P B
d
j
d
l
?
1
=
2
i
=
C
1
(
f
+1)
1
=f
r
1
2
f
+1
. The
dierence of distances is
j
P A
d
?
P B
d
j
=
f
P
d
i
=1
(
P
i
)
f
g
1
=f
? f
P
d
i
=1
(
Q
i
)
f
g
1
=f
=
f
P
d
i
=1
(
P
i
)
f
g
l
? f
P
d
i
=1
(
Q
i
)
f
g
l
. Note that the above expression is of the form
j
a
l
?
b
l
j
=
j
a
?
b
j
(
P
l
?
1
r
=0
a
r
b
l
?
r
?
1
). Therefore,
j
P A
d
?
P B
d
j
can b e written as
f
P
d
i
=1
j
(
P
i
)
f
?
(
Q
i
)
f
jg f
P
l
?
1
r
=0
(
QA
d
)
r
(
QB
d
)
l
?
r
?
1
g
. By dividing both sides by
d
1
=f
?
1
=
2
and regrouping the right hand side we get:
j
P A
d
?
P B
d
j
d
1
=f
?
1
=
2
!
p
f
P
d
i
=1
j
(
P
i
)
f
?
(
Q
i
)
f
j
p
d
g f
l
?
1
X
r
=0
QA
d
d
r
QB
d
d
l
?
r
?
1
g
(10)
4
Empirical simulations of the relative contrast show this is indeed the case.
By using the results in Equation 10, we can derive that:
j
P A
d
?
P B
d
j
d
1
=f
?
1
=
2
!
p
f
P
d
i
=1
j
(
P
i
)
f
?
(
Q
i
)
f
j
p
d
g f
l
1
(1 +
f
)
l
?
1
g
(11)
This random variable (
P
i
)
f
?
(
Q
i
)
f
has zero mean and standard deviation which
is
p
2
where
is the standard deviation of (
P
i
)
f
. The sum of dierent values
of (
P
i
)
f
?
(
Q
i
)
f
over
d
dimensions will converge to normal distribution with
mean 0 and standard deviation 2
p
d
because of the central limit theorem.
Consequently, the exp ected mean average deviation of this normal distribution
is
C
p
d
for some constant
C
. Therefore, we have:
lim
d
!1
E
j
(
P A
d
)
f
?
(
P B
d
)
f
j
p
d
=
C
=
C
f
f
+ 1
s
1
2
f
+ 1
:
(12)
Combining the results of Equations 12 and 11, we get:
lim
d
!1
E
j
P A
d
?
P B
d
j
d
1
=f
?
1
=
2
=
C
(
f
+ 1)
1
=f
s
1
2
f
+ 1
(13)
An direct consequence of the above result is the following generalization to
N
=
n
points.
Corollary 3.
When
F
is the uniform distribution of
N
=
n
points and
f
= 1
=l
for some integer
l
. Then, for some constant
C
we have:
C
(
f
+1)
1
=f
r
1
2
f
+1
lim
d
!1
E
Dmax
f
d
?
Dmin
f
d
d
1
=f
?
1
=
2
C
(
n
?
1)
(
f
+1)
1
=f
r
1
2
f
+1
.
Proof.
Similar to corollary 1.
The above result shows that the absolute dierence between the maximum
and minimum for the fractional distance metric increases at the rate of
d
1
=f
?
1
=
2
.
Thus, the smaller the fraction, the greater the rate of absolute divergence be-
tween the maximum and minimum value. Now, we will examine the relative
contrast of the fractional distance metric.
Theorem 3.
Let
F
be the uniform distribution of
N
= 2
points and
f
= 1
=l
for some integer
l
. Then,
lim
d
!1
Dmax
f
d
?
Dmin
f
d
Dmin
f
d
p
d
=
C
q
1
2
f
+1
for some constant
C
.
Proof.
Analogous to the proof of Theorem 2.
The following is the direct generalization to
N
=
n
points.
Corollary 4.
Let
F
be the uniform distribution of
N
=
n
points, and
f
= 1
=l
for some integer
l
. Then, for some constant
C
C
q
1
2
f
+1
lim
d
!1
E
Dmax
f
d
?
Dmin
f
d
Dmin
f
d
C
(
n
?
1)
q
1
2
f
+1
.
Proof.
Analogous to the proof of Corollary 1.
This result is true for the case of arbitrary values
f
(not just
f
= 1
=l
) and
N
, but the use of these specic values of
f
helps considerably in simplication of
the proof of the result. The empirical simulation in Figure 2, shows the behavior
for arbitrary values of
f
and
N
. The curve for each value of
N
is dierent but all
curves t the general trend of reduced contrast with increased value of
f
. Note
that the value of the relative contrast for both, the case of integral distance
metric
L
k
and fractional distance metric
L
f
is the same in the b oundary case
when
f
=
k
= 1.
The above results show that fractional distance metrics provide better con-
trast than integral distance metrics both in terms of the absolute distributions of
points to a given query point and relative distances. This is a surprising result in
light of the fact that the Euclidean distance metric is traditionally used in a large
variety of indexing structures and data mining applications. The widespread use
of the Euclidean distance metric stems from the natural extension of applicabil-
ity to spatial database systems (many multidimensional indexing structures were
initially proposed in the context of spatial systems). However, from the perspec-
tive of high dimensional data mining applications, this natural interpretability
in 2 or 3-dimensional spatial systems is completely irrelevant. Whether the the-
oretical behavior of the relative contrast also translates into practically useful
implications for high dimensional data mining applications is an issue which we
will examine in greater detail in the next section.
4 Empirical Results
In this section, we show that our surprising ndings can be directly applied to
improve existing mining techniques for high-dimensional data. For the experi-
ments, we use synthetic and real data. The synthetic data consists of a number
of clusters (data inside the clusters follow a normal distribution and the cluster
centers are uniformly distributed). The advantage of the synthetic data sets is
that the clusters are clearly separated and any clustering algorithm should be
able to identify them correctly. For our experiments we used one of the most
widely used standard clustering algorithms - the
k-means algorithm
. The data
set used in the experiments consists of 6 clusters with 10000 data p oints each and
no noise. The dimensionality was chosen to be 20. The results of our experiments
show that the fractional distance metrics provides a much higher classication
rate which is about 99% for the fractional distance metric with
f
= 0
:
3 versus
89% for the Euclidean metric (see gure 4). The detailed results including the
confusion matrices obtained are provided in the appendix.
For the experiments with real data sets, we use some of the classication
problems from the UCI machine learning repository
5
. All of these problems
are classication problems which have a large number of feature variables, and
a special variable which is designated as the class label. We used the following
5
http
:
==www:cs:uci:edu=
~
mlearn
50
55
60
65
70
75
80
85
90
95
100
0 0.5 1 1.5 2 2.5 3
Classification Rate
Distance Parameter
Fig. 4.
Eectiveness of k-Means
simple experiment: For each of the cases that we tested on, we
stripped o
the
class variable from the data set and considered the feature variables only. The
query points were picked from the original database, and the closest
l
neighbors
were found to each target point using dierent distance metrics. The technique
was tested using the following two measures:
1. Class Variable Accuracy:
This was the primary measure that we used
in order to test the quality of the dierent distance metrics. Since the class vari-
able is known to depend in some way on the feature variables, the proximity of
objects belonging to the same class in feature space is evidence of the meaning-
fulness of a given distance metric. The specic measure that we used was the
total number of the
l
nearest neighbors that belonged to the same class as the
target object over all the dierent target objects. Needless to say, we do not
intend to propose this rudimentary unsupervised technique as an alternative to
classication mo dels, but use the classication performance only as an evidence
of the meaningfulness (or lack of meaningfulness) of a given distance metric. The
class labels may not necessarily always correspond to locality in feature space;
therefore the meaningfulness results presented are evidential in nature. However,
a consistent eect on the class variable accuracy with increasing norm parameter
does tend to be a powerful way of demonstrating qualitative trends.
2. Noise Stability:
How do es the quality of the distance metric vary with
more or less noisy data? We used
noise masking
in order to evaluate this aspect.
In noise masking, each entry in the database was replaced by a random entry
with masking probability
p
c
. The random entry was chosen from a uniform
distribution centered at the mean of that attribute. Thus, when
p
c
is 1, the data
is completely noisy. We studied how each of the two problems were aected by
noise masking.
In Table 3, we have illustrated some examples of the variation in p erformance
for dierent distance metrics. Except for a few exceptions, the ma jor trend in
this table is that the accuracy performance decreases with increasing value of the
norm parameter. We have show the table in the range
L
0
:
1
to
L
10
because it was
easiest to calculate the distance values without exceeding the numerical ranges in
the computer representation. We have also illustrated the accuracy performance
when the
L
1
metric is used. One interesting observation is that the accuracy
Table 3.
Number of correct class label matches between nearest neighbor and target
Data Set
L
0
:
1
L
0
:
5
L
1
L
2
L
4
L
10
L
1
Random
Machine
522 474 449 402 364 353 341 153
Musk
998 893 683 405 301 272 163 140
Breast Cancer (wdbc)
5299 5268 5196 5052 4661 4172 4032 3021
Segmentation
1423 1471 1377 1210 1103 1031 300 323
Ionosphere
2954 3002 2839 2430 2062 1836 1769 1884
0 1 2 3 4 5 6 7 8 9 10
0
0.5
1
1.5
2
2.5
3
3.5
4
ACCURACY OF RANDOM MATCHING
PARAMETER OF DISTANCE NORM USED
ACCURACY RATIO TO RANDOM MATCHING
Fig. 5.
Accuracy depending on the norm
parameter
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.5
1
1.5
2
2.5
3
3.5
ACCURACY OF RANDOM MATCHING
NOISE MASKING PROBABILITY
ACCURACY RATIO
L(0.1)
L(1)
L(10)
Fig. 6.
Accuracy depending on noise
masking
with the
L
1
distance metric is often worse than the accuracy value by picking
a record from the database at random and reporting the corresponding target
value. This trend is observed because of the fact that the
L
1
metric only looks
at the dimension at which the target and neighbor are furthest apart. In high
dimensional space, this is likely to b e a very poor representation of the nearest
neighbor. A similar argument is true for
L
k
distance metrics (for high values of
k
) which provide undue imp ortance to the distant (sparse/noisy) dimensions.
It is precisely this aspect which is reected in our theoretical analysis of the
relative contrast, which results in distance metrics with high norm parameters
to be poorly discriminating between the furthest and nearest neighbor.
In Figure 5, we have shown the variation in the accuracy of the class variable
matching with
k
, when the
L
k
norm is used. The accuracy on the
Y
-axis is
reported as the ratio of the accuracy to that of a completely random matching
scheme. The graph is averaged over all the data sets of Table 3. It is easy to see
that there is a clear trend of the accuracy worsening with increasing values of
the parameter
k
.
We also studied the robustness of the scheme to the use of noise masking.
For this purpose, we have illustrated the performance of three distance metrics
in Figure 6:
L
0
:
1
,
L
1
, and
L
10
for various values of the masking probability on
the machine data set. On the
X
-axis, we have denoted the value of the masking
probability, whereas on the
Y
-axis we have the accuracy ratio to that of a com-
pletely random matching scheme. Note that when the masking probability is 1,
then any scheme would degrade to a random method. However, it is interesting
to see from Figure 6 that the
L
10
distance metric degrades much faster to the
random performance (at a masking probability of 0.4), whereas the
L
1
degrades
to random at 0.6. The
L
0
:
1
distance metric is most robust to the presence of
noise in the data set and degrades to random performance at the slowest rate.
These results are closely connected to our theoretical analysis which shows the
rapid lack of discrimination between the nearest and furthest distances for high
values of the norm-parameter because of undue weighting being given to the
noisy dimensions which contribute the most to the distance.
5 Conclusions and Summary
In this paper, we showed some surprising results of the qualitative behavior of
the dierent distance metrics for measuring proximity in high dimensionality.
We demonstrated our results in both a theoretical and empirical setting. In the
past, not much attention has been paid to the choice of distance metrics used
in high dimensional applications. The results of this pap er are likely to have a
powerful impact on the particular choice of distance metric which is used from
problems such as clustering, categorization, and similarity search; all of which
depend upon some notion of proximity.
References
1. Weber R., Schek H.-J., Blott S.: A Quantitative Analysis and Performance Study
for Similarity-Search Methods in High-Dimensional Spaces.
VLDB Conference Pro-
ceedings
, 1998.
2. Bennett K. P., Fayyad U., Geiger D.: Density-Based Indexing for Approximate
Nearest Neighbor Queries.
ACM SIGKDD Conference Proceedings
, 1999.
3. Berchtold S., Bohm C., Kriegel H.-P.: The Pyramid Technique: Towards Breaking
the Curse of Dimensionality.
ACM SIGMOD Conference Proceedings
, June 1998.
4. Berchtold S., Bohm C., Keim D., Kriegel H.-P.: A Cost Model for Nearest Neighbor
Search in High Dimensional Space.
ACM PODS Conference Proceedings
, 1997.
5. Berchtold S., Ertl B., Keim D., Kriegel H.-P. Seidl T.: Fast Nearest Neighbor Search
in High Dimensional Spaces.
ICDE Conference Proceedings
, 1998.
6. Beyer K., Goldstein J., Ramakrishnan R., Shaft U.: When is Nearest Neighbors
Meaningful?
ICDT Conference Proceedings
, 1999.
7. Shaft U., Goldstein J., Beyer K.: Nearest Neighbor Query Performance for Unsta-
ble Distributions. Technical Report TR 1388, Department of Computer Science,
University of Wisconsin at Madison.
8. Guttman, A.: R-Trees: A Dynamic Index Structure for Spatial Searching.
ACM
SIGMOD Conference Proceedings
, 1984.
9. Hinneburg A., Aggarwal C., Keim D.: What is the nearest neighbor in high dimen-
sional spaces?
VLDB Conference Proceedings
, 2000.
10. Katayama N., Satoh S.: The SR-Tree: An Index Structure for High Dimensional
Nearest Neighbor Queries.
ACM SIGMOD Conference Proceedings
, 1997.
11. Lin K.-I., Jagadish H. V., Faloutsos C.: The TV-tree: An Index Structure for High
Dimensional Data.
VLDB Journal
, Volume 3, Number 4, pages 517{542, 1992.
Appendix
Here we provide a detailed proof of Lemma 2, which proves our modied conver-
gence results for arbitrary distributions of points. This Lemma shows that the
asymptotical rate of convergence of the absolute dierence of distances between
the nearest and furthest points is dependent on the distance norm used. To re-
cap, we restate Lemma 2.
Lemma 2:
Let
F
be an arbitrary distribution of
N
= 2
points. Then,
lim
d
!1
E
h
Dmax
k
d
?
Dmin
k
d
d
1
=k
?
1
=
2
i
=
C
k
, where
C
k
is some constant dependent on
k
.
Proof.
Let
A
d
and
B
d
be the two points in a
d
dimensional data distribution
such that each coordinate is independently drawn from the data distribution
F
.
Specically
A
d
= (
P
1
:::P
d
) and
B
d
= (
Q
1
: : : Q
d
) with
P
i
and
Q
i
being drawn
from
F
. Let
P A
d
=
f
P
d
i
=1
(
P
i
)
k
g
1
=k
be the distance of
A
d
to the origin using
the
L
k
metric and
P B
d
=
f
P
d
i
=1
(
Q
i
)
k
g
1
=k
the distance of
B
d
.
We assume that the
k
th power of a random variable drawn from the dis-
tribution
F
has mean
F
;k
and standard deviation
F
;k
. This means that:
P A
k
d
d
!
p
F
;k
;
P B
k
d
d
!
p
F
;k
and therefore:
P A
d
=d
1
=k
!
p
(
F
;k
)
1
=k
; P B
d
=d
1
=k
!
p
(
F
;k
)
1
=k
:
(14)
We intend to show that
j
P A
d
?
P B
d
j
d
1
=k
?
1
=
2
!
p
C
k
for some constant
C
k
depending
on
k
. We express
j
P A
d
?
P B
d
j
in the following numerator/denominator form
which we will use in order to examine the convergence behavior of the numerator
and denominator individually.
j
P A
d
?
P B
d
j
=
j
(
P A
d
)
k
?
(
P B
d
)
k
j
P
k
?
1
r
=0
(
P A
d
)
k
?
r
?
1
(
P B
d
)
r
(15)
Dividing both sides by
d
1
=k
?
1
=
2
and regrouping on right-hand-side we get
j
P A
d
?
P B
d
j
d
1
=k
?
1
=
2
=
j
(
P A
d
)
k
?
(
P B
d
)
k
j
=
p
d
P
k
?
1
r
=0
?
P A
d
d
1
=k
k
?
r
?
1
?
P B
d
d
1
=k
r
(16)
Consequently, using Slutsky's theorem and the results of Equation 14 we have:
k
?
1
X
r
=0
P A
d
=d
1
=k
k
?
r
?
1
P B
d
=d
1
=k
r
!
p
k
(
F
;k
)
(
k
?
1)
=k
(17)
Having characterized the convergence behavior of the denominator of the right-
hand-side of Equation 16, let us now examine the behavior of the numerator:
j
(
P A
d
)
k
?
(
P B
d
)
k
j
=
p
d
=
j
P
d
i
=1
((
P
i
)
k
?
(
Q
i
)
k
)
j
=
p
d
=
j
P
d
i
=1
R
i
j
=
p
d
.
Here
R
i
is the new random variable dened by ((
P
i
)
k
?
(
Q
i
)
k
)
8
i
2 f
1
;:::d
g
.
This random variable has zero mean and standard deviation which is
p
2
F
;k
where
F
;k
is the standard deviation of (
P
i
)
k
. Then, the sum of dierent values
of
R
i
over
d
dimensions will converge to a normal distribution with mean 0
and standard deviation
p
2
F
;k
p
d
because of the central limit theorem.
Consequently, the mean average deviation of this distribution will be
C
F
;k
for some constant
C
. Therefore, we have:
lim
d
!1
E
j
(
P A
d
)
k
?
(
P B
d
)
k
j
p
d
=
C
F
;k
(18)
Since the denominator of Equation 16 shows probabilistic convergence, we can
combine the results of Equations 17 and 18 to obtain:
lim
d
!1
E
j
P A
d
?
P B
d
j
d
1
=k
?
1
=
2
=
C
F
;k
k
(
k
?
1)
=k
F
;k
(19)
The result follows.
Confusion Matrices
We have illustrated the confusion matrices for two
dierent values of
p
below. As illustrated, the confusion matrix for using the
value
p
= 0
:
3 is signicantly better than the one obtained using
p
= 2.
Table 4.
Confusion Matrix- p=2, (rows for prototype, colums for cluster)
1208 82 9711 4 10 14
0 2 0 0 6328 4
1 9872 104 32 11 0
8750 8 74 9954 1 18
39 0 10 8 8 9948
2 36 101 2 3642 16
Table 5.
Confusion Matrix- p=0.3, (rows for prototype, colums for cluster)
51 115 9773 10 37 15
0 17 24 0 9935 14
15 10 9 9962 0 4
1 9858 66 5 19 1
8 0 9 3 9 9956
9925 0 119 20 0 10
... However, the increasing dimensionality and sparsity of modern datasets pose significant challenges to traditional clustering methods. High-dimensional spaces often dilute the effectiveness of conventional distance metrics and exacerbate the curse of dimensionality, leading to suboptimal clustering performance [1]. ...
... Despite its potential, scRNASeq data processing is beset with challenges, primarily due to the noisy, low-resolution nature of the raw data, typically represented as an m × d matrix of cells (samples) and gene counts (features). For instance, gene expression is often captured across thousands of genes (dimensions), clustering such high-dimensional space is difficult due to distance measures becoming less meaningful in distinguishing data points [1]. scRNASeq data are characteristically sparse, i.e., several genes have zero/near-zero counts across the cells, thus making it difficult to find meaningful structures, or establish similarity between data points [3]. ...
... Several algorithmic variations enhance k-means' capabilities: pcaReduce [19] employs hierarchical merging of clusters on principal components while dynamically optimizing cluster numbers through iterative PC removal; SAIC [20] performs feature selection and automatic cluster number adjustment; SCUBA [21] extends k-means to temporal analysis by incorporating bifurcation detection in developmental trajectories. To address the inefficacy of Euclidean metrics in high dimensions [1], RaceID3 [22] implements k-medoids [23] with Pearson correlation distances. ...
Article
Full-text available
This paper introduces p-ClustVal, a novel data transformation technique inspired by p-adic number theory that significantly enhances cluster discernibility in genomics data, specifically single-cell RNA sequencing (scRNASeq). By leveraging p-adic-valuation, p-ClustVal integrates with and augments widely used clustering algorithms and dimension reduction techniques, amplifying their effectiveness in discovering meaningful structure from data. The transformation uses a data-centric heuristic to determine optimal parameters, without relying on ground truth labels, making it more user-friendly. p-ClustVal reduces overlap between clusters by employing alternate metric spaces inspired by p-adic-valuation, a significant shift from conventional methods. Our comprehensive evaluation spanning 30 experiments and over 1400 observations shows that p-ClustVal improves performance in 91% of cases and boosts the performance of classical and state-of-the-art (SOTA) methods. This work contributes to data analytics and genomics by introducing a unique data transformation approach, enhancing downstream clustering algorithms, and providing empirical evidence of p-ClustVal’s efficacy. The study concludes with insights into the limitations of p-ClustVal and future research directions.
... However, quadratic potential is known to be sensitive to outliers in the data set. Also, purely quadratic potentials can suffer from the curse of dimensionality, not being able to robustly discriminate 'close' and 'distant' point neighbours in a high-dimensional space [12]. ...
... Several algorithms for constructing PCA with L 1 norm have been suggested [23,24,25] and systematically benchmarked [26,27]. Some authors go even beyond linear metrics and suggests that fractional quasinorms L p (0 < p < 1) can be more appropriate in high-dimensional data approximation [12]. ...
... Introducing fractional quasinorms in existing data-mining techniques can potentially deal with the curse of dimensionality, helping to better distinguish close from distant data points [12] Regularized and sparse regression Lasso ...
Preprint
Full-text available
Most of machine learning approaches have stemmed from the application of minimizing the mean squared distance principle, based on the computationally efficient quadratic optimization methods. However, when faced with high-dimensional and noisy data, the quadratic error functionals demonstrated many weaknesses including high sensitivity to contaminating factors and dimensionality curse. Therefore, a lot of recent applications in machine learning exploited properties of non-quadratic error functionals based on L1L_1 norm or even sub-linear potentials corresponding to quasinorms LpL_p (0<p<10<p<1). The back side of these approaches is increase in computational cost for optimization. Till so far, no approaches have been suggested to deal with {\it arbitrary} error functionals, in a flexible and computationally efficient framework. In this paper, we develop a theory and basic universal data approximation algorithms (k-means, principal components, principal manifolds and graphs, regularized and sparse regression), based on piece-wise quadratic error potentials of subquadratic growth (PQSQ potentials). We develop a new and universal framework to minimize {\it arbitrary sub-quadratic error potentials} using an algorithm with guaranteed fast convergence to the local or global error minimum. The theory of PQSQ potentials is based on the notion of the cone of minorant functions, and represents a natural approximation formalism based on the application of min-plus algebra. The approach can be applied in most of existing machine learning methods, including methods of data approximation and regularized and sparse regression, leading to the improvement in the computational cost/accuracy trade-off. We demonstrate that on synthetic and real-life datasets PQSQ-based machine learning methods achieve orders of magnitude faster computational performance than the corresponding state-of-the-art methods.
... For these reasons we have decided to focus the current study on clustering of DLPs. Whilst clustering of long Multi-Day Load Profiles is also performed [77,92,9], it falls outside the scope of this particular study as longer time series require their own dedicated dimensionality reduction techniques to avoid the curse of dimensionality [2]. ...
... SBD had the highest average accuracy out of the parameterless methods, followed closely by the higher order Minkowski metrics, which narrowly outperformed ED. The lower order Minkowski metrics ( ∈ {0.5, 1}) performed significantly worse, despite suggestions they are more appropriate for high dimensional data [2]. The domain specific FD and noise robust CID did not perform differently from ED with statistical significance. ...
... This accounts for the extra 1.5 hours required for convergence -which happens to coincide with the convergence of DTW with no window parameter (i.e. ED).2 Note that for MPD, a window size of 42 allows dissimilarities to be computed using the ED between contiguous subsequences offset by a maximum of 6 timepoints. ...
Preprint
Full-text available
The widespread adoption of smart meters for monitoring energy consumption has generated vast quantities of high-resolution time series data which remains underutilised. While clustering has emerged as a fundamental tool for mining smart meter time series (SMTS) data, selecting appropriate clustering methods remains challenging despite numerous comparative studies. These studies often rely on problematic methodologies and consider a limited scope of methods, frequently overlooking compelling methods from the broader time series clustering literature. Consequently, they struggle to provide dependable guidance for practitioners designing their own clustering approaches. This paper presents a comprehensive comparative framework for SMTS clustering methods using expert-informed synthetic datasets that emphasise peak consumption behaviours as fundamental cluster concepts. Using a phased methodology, we first evaluated 31 distance measures and 8 representation methods using leave-one-out classification, then examined the better-suited methods in combination with 11 clustering algorithms. We further assessed the robustness of these combinations to systematic changes in key dataset properties that affect clustering performance on real-world datasets, including cluster balance, noise, and the presence of outliers. Our results revealed that methods accommodating local temporal shifts while maintaining amplitude sensitivity, particularly Dynamic Time Warping and k-sliding distance, consistently outperformed traditional approaches. Among other key findings, we identified that when combined with hierarchical clustering using Ward's linkage, these methods demonstrated consistent robustness across varying dataset characteristics without careful parameter tuning. These and other findings inform actionable recommendations for practitioners.
... Both Euclidean and Manhattan distances were used to compute the number of clusters, leading to the same 4-cluster solution. Manhattan distances are preferred for cluster scoring estimation due to their lower sensitivity to outliers (Kumar, 2017) and their better performance in cases of high dimensionality (Aggarwal et al., 2001). ...
Article
Several studies have investigated crossmodal associations involving audiovisual stimuli. To date, however, far fewer studies have explored the relationship between musical timbre and visual features (e.g., soft/harsh timbres with blue/red colours). To fill this gap in the literature, 249 participants were invited to judge the match between different coloured images and musical excerpts. The images depicted seven characters from Saint-Saëns’ “Carnival of the Animals”; the audio stimuli consisted of the music the composer created to represent each character. To test the effect of timbre and culture, the audio stimuli were presented either in the original orchestral version or in the piano transcription, while the participants were recruited from various countries, encompassing both Western and non-Western nationalities. The results demonstrate that timbre influences crossmodal associations between musical excerpts and drawings, while these associations remain consistent across cultures, languages, and levels of musical background.
... For example, for high-dimensional x n ∈ R D with Gaussian basis ρ(·) and Euclidean distance d(·), the distances and the corresponding basis outputs become diluted. This has motivated the use of hierarchical architectures, where supervised training is specified in a lower-dimensional embedding [37]. A widely used version of prototypical layers that augment existing embedding architectures is the prototypical network from Snell et al. [31]. ...
Article
Full-text available
Objective and continuous monitoring of Parkinson’s disease (PD) tremor in free-living conditions could benefit both individual patient care and clinical trials, by overcoming the snapshot nature of clinical assessments. To enable robust detection of tremor in the context of limited amounts of labeled training data, we propose to use prototypical networks, which can embed domain expertise about the heterogeneous tremor and non-tremor sub-classes. We evaluated our approach using data from the Parkinson@Home Validation study, including 8 PD patients with tremor, 16 PD patients without tremor, and 24 age-matched controls. We used wrist accelerometer data and synchronous expert video annotations for the presence of tremor, captured during unscripted daily life activities in and around the participants’ own homes. Based on leave-one-subject-out cross-validation, we demonstrate the ability of prototypical networks to capture free-living tremor episodes. Specifically, we demonstrate that prototypical networks can be used to enforce robust performance across domain-informed sub-classes, including different tremor phenotypes and daily life activities.
... The final co-registered images consist of 662 contiguous bands over a wavelength range between 412 and 2500 nm. While this high dimensionality facilitates capturing intricate spectral profiles, it unfortunately suffers from a few intractable limitations, including the 'curse of dimensionality' phenomenon, where the complexity of data analysis tasks grows exponentially with increases in dimensionality [81]; data sparsity in higher dimensional spaces [82]; and increasing the effects of noise and irrelevant features [83]. We therefore applied dimensionality reduction to our data using PCA (Principal Component Analysis), a commonly used dimensionality reduction method for HSI applications [62,84,85]. ...
Article
Full-text available
Wheat is a globally cultivated cereal crop with substantial protein content present in its seeds. This research aimed to develop robust methods for predicting seed protein concentration in wheat seeds using bench-top hyperspectral imaging in the visible, near-infrared (VNIR), and shortwave infrared (SWIR) regions. To fully utilize the spectral and texture features of the full VNIR and SWIR spectral domains, a computer-vision-aided image co-registration methodology was implemented to seamlessly align the VNIR and SWIR bands. Sensitivity analyses were also conducted to identify the most sensitive bands for seed protein estimation. Convolutional neural networks (CNNs) with attention mechanisms were proposed along with traditional machine learning models based on feature engineering including Random Forest (RF) and Support Vector Machine (SVM) regression for comparative analysis. Additionally, the CNN classification approach was used to estimate low, medium, and high protein concentrations because this type of classification is more applicable for breeding efforts. Our results showed that the proposed CNN with attention mechanisms predicted wheat protein content with R2 values of 0.70 and 0.65 for ventral and dorsal seed orientations, respectively. Although, the R2 of the CNN approach was lower than of the best performing feature-based method, RF (R2 of 0.77), end-to-end prediction capabilities with CNN hold great promise for the automation of wheat protein estimation for breeding. The CNN model achieved better classification of protein concentrations between low, medium, and high protein contents, with an R2 of 0.82. This study’s findings highlight the significant potential of hyperspectral imaging and machine learning techniques for advancing precision breeding practices, optimizing seed sorting processes, and enabling targeted agricultural input applications.
... For the numerical experiments in Section 5, the Euclidean distance is employed but alternative solutions have been proposed in the literature, see, for example, [53]. ...