Content uploaded by Aparna S. Varde
Author content
All content in this area was uploaded by Aparna S. Varde on Mar 03, 2019
Content may be subject to copyright.
Extending a Distance Metric Learning Approach to cover Non-Geometric Spaces
Aparna S. Varde1 and Jianyu Liang2
1. Department of Computer Science. Montclair State University, NJ, USA, vardea@montclair.edu
2. Department of Mechanical Engineering, Worcester Polytechnic Institute, MA, USA, jianyul@wpi.edu
Abstract!
In this paper we address the problem of learning a notion of
similarity for complex data in scientific domains. In
particular we focus on images in nanotechnology. When
scientists compare such data, they often have subtle criteria
that cannot be represented in vector space, such as the
presence, absence or extent of domain-specific occurrences
attributed to the results of experiments. In addition, there are
measureable criteria in terms of visual features. While a
human expert can intuitively combine these criteria to detect
the similarity during data analysis, it is challenging to capture
this notion of similarity computationally. We propose to learn
a similarity (or distance) function that can be used for
computational analysis of complex scientific data such as
nanoscale iamges. We investigate the use of a distance metric
learning approach to learn a non-(geo)metric distance as a
notion of similarity. We consider as an example our earlier
approach called LearnMet that was proposed to learn metric
distances for graphical plots in a domain-specific context. We
discuss the issues involved in extending a metric learning
approach to cover non-metric distances. Although, we deal
mainly with nanoscale images in this paper, the logic used
here can be applied to other non-geometric spaces.
1. Background and Motivation
Nanotechnology is a growing field today with numerous applications.
For example, studying images of material substrates at a nanoscale level
helps in understanding the impact of physical and chemical treatments
on them under various conditions. This in turn contributes to
applications such as developing material implants for the human body.
In order to observe the differences between various such nanostructural
images and draw conclusions for corresponding applications, it is
essential to capture notion of similarity for the images. We present
examples of such images used in our work in Figure 1.
Figure 1: Sample Images in Nanotechnology
This figure depicts images taken with a Scanning Electron Microscope
(SEM) of Anodized Aluminum Oxide (AAO) templates after partial
etching with 10% phosphoric acid (H3PO4) for around 3 hours. Image (a)
—————
Submitted to ICML 2010 Workshops Program
shows untreated AAO templates after 5 avid in filtration steps. Image (b)
shows AAO templates pre-treated with propylphosphonic acid and
glutaralydehyde after 5 avidin filtration steps. Image (c) shows AAO
templates pre-treated with polyethylenimine after 5 avidin filtration
steps [1]. The nature of the images depicts the results of the experiments
important for analysis by scientists. As seen from these images, it is hard
to define similarity only in terms of visual features such as particle size
and inter-particle distance. However, a domain expert would be able to
group these images as being similar or different with respect to other
images, thus providing training data in the form of clusters. This can
serve as the basis to learn a non-metric distance for computational
analysis. In order to learn such a distance, we consider the use of an
approach called LearnMet [3] proposed earlier for learning metric
distances over graphical plots in scientific domains. We briefly
overview this approach next.
2. Distance Metric Learning Approach
In the LearnMet approach to learn metric distances for graphical plots,
the distance D(A, B) between a pair of plots A and B is defined as the
weighted sum of individual distance metric components. Thus, D(A, B)
= w1D1(A,B) … + wmDm(A, B), where D1… Dm are the individual metrics
that apply to the plots and w1… wm are the weights depicting the relative
importance of those metrics. For example, these individual components
could be Euclidean distance between the points on the plots, statistical
distances pertaining to their mean, median or mode values and domain-
specific metrics (e.g., the Leidenfrost distance in Materials Science [3]).
The input to LearnMet is a training set with actual or true clusters of
graphical plots provided by domain experts. This serves to objectively
specify the notion of correctness in the domain as subjectively reasoned
by the domain experts.
The steps of LearnMet are: (1) guess an initial metric D as a weighted
sum of metrics applicable to the domain; (2) use that metric D for
clustering with an arbitrary but fixed clustering algorithm to get
predicted clusters; (3) evaluate clustering accuracy by comparing
predicted and actual clusters to obtain the error between them; (4) adjust
the metric D based on the error, and re-execute clustering and evaluation
until error is minimal or below a threshold; (5) output the metric D
giving lowest error as the learned metric.
A crucial aspect of LearnMet is the manner in which the clustering
accuracy is evaluated and the metric is adjusted based on the error
between predicted and actual clusters [3]. We summarize this
considering a pair of graphical plots A and B.
(A, B) is a True Positive (TP) pair if A and B are in the same predicted
cluster and in the same actual cluster.
(A, B) is a True Negative (TN) pair if A and B are in different predicted
clusters and in different actual clusters.
(A, B) is a False Positive (FP) pair if A and B are in the same predicted
cluster but in different actual clusters.
(A,B) is a False Negative (FN) pair if A and B are in different predicted
clusters but in the same actual cluster.
Error in any given epoch is defined as (FP+FN) / (TP+TN+FP+FN)
where TP, TN, FP and FN are the number of true positives, true
negatives, false positives and false negatives respectively [3]. If the error
is above a given threshold, then the distance metric needs to be adjusted
because the clustering is performed with respect to that distance metric.
For this adjustment, distances DFN and DFP are defined as the average
distance using the metric D of the false negative pairs and of the false
positive pairs respectively [3]. These are calculated as follows:
DFN = (1/FN) ! [n=1to FN] D(A,B) where (A, B) is each FN pair.
DFP = (1/FP) ! [n=1to FP] D(A,B) where (A, B) is each FP pair.
Weights in the distance metric are then adjusted using heuristics that
work differently for the FP and FN pairs. The argument is that to reduce
the error caused due to false negative pairs, it is desirable to reduce the
distance DFN, to increase the likelihood of these pairs being correctly
placed in the same cluster in the next epoch. This is done by reducing
the weights of one or more components in the distance metric in
proportion to the fraction of the error caused by that component. Thus,
for each component Di, its new weight due to FN adjustment is wi` = wi
– DFNi / DFN. This is termed as the FN heuristic [3]. Conversely, to
reduce the error due to false positive pairs, it is desirable to increase the
distance DFP, so that they would likely be in different clusters in the
next epoch as required. This is done by increasing the weights of one or
more components proportionately. Thus, for each component Di, its new
weight due to FP adjustment is wi`` = wi + DFPi / DFP. This is known
as the FP heuristic [3]. Combining these formulae and ignoring negative
weights for simplicity, we get a weight adjustment heuristic [3] as
follows. For each component Di, its new weight is wi``` = max(0, wi -
DFNi / DFN + DFPi / DFP).
This weight adjustment continues until the error drops below
threshold or a maximum number of epochs is reached. Finally, the
metric giving lowest error is output as the learned distance metric.
3. Extension to Non-Geometric Spaces
Since the LearnMet approach has been thoroughly experimented with
real scientific data and has been found to yield very good results, we
consider extending it to non-geometric spaces. This leads to general
arguments about how metric learning approaches can potentially be
reused for non-metrics.
Metrics v/s Non-Metrics: A distance is a metric if it satisfies all the
following properties [2] (i) It is non-negative. (ii) Distance of an object
to itself is zero. (iii) It is commutative, thus Distance(P,Q) =
Distance(Q, P) for any objects P,Q in n-dimensional space. (iv) It
satisfies triangle inequality, i.e., if 3 objects P,Q and R form a triangle in
n-dimensional space, then sum of any two sides is greater than the third,
e.g., Distance(P,Q) + Distance(Q,R) > Distance(P,R). If any of these
properties is violated, the distance is a non-metric.
Adaptation of Learning: Given the steps of a distance metric learning
approach such as LearnMet, the constraints for the distance to be a
metric are important mainly in the basic distance formula D(A, B) =
w1D1(A,B) … + wmDm(A, B) and consequently in formulae involving
calculation of distances DFN and DFP. Hence, if appropriate formulae
can be objectively defined for non-metrics, then the remaining steps can
be adapted in the learning process. This includes details such as
definitions of TP, TN, FP and FN and formulae for FP, FN and weight
adjustment heuristics with reference to context. Likewise, we argue that
in other learning techniques, similar adaptation can be performed. Using
this logic, we propound the following general hypothesis.
Hypothesis: A distance metric learning approach can be extended to
cover non-(geo)metric spaces provided the non-metric distance(s) can be
objectively formulated, preserving the constraints in the approach.
Challenges and Solutions: It is a challenge to objectively formulate
the individual components that constitute the non-metric distance. For
example, in nanotechnology, a distance to capture the fact that a material
has been significantly etched, is hard to define objectively. As a
solution, we propose an approximate mapping in terms of levels of
etching that can be captured by ordinal distances. This is analogous to
gold, silver and bronze being 1, 2 and 3 respectively. Accordingly, we
propose other subjective to objective mappings that are domain-specific
and can preserve the intuitive reasoning of the human experts. A detailed
explanation of these is beyond the scope of this paper. We claim that
similar objective distance components can be defined in non-geometric
spaces in other domains and the adaptation of learning can thus be
performed. Another significant challenge is to obtain all the individual
components for learning. We address the solution to this deploying a
technique we have proposed for component selection in complex data
sets that considers the use of greedy and exhaustive searches and their
hybrid combinations for distance function learning [4].
Computational Complexity: The complexity of the algorithm in
LearnMet in the best case is O(m) where m is the number of components
in the distance metric and the components are selected using a greedy
search [4]. In the worse case where the components are selected using an
exhaustive search, the complexity is O(2m -1) [4]. Somewhat similar
complexity can be obtained for non-metric distances if the components
are selected using greedy and exhaustive searches respectively, because
the manner in which the complexity is calculated does not depend on the
distance being a metric [4]. This argument would probably hold good for
any learning algorithm that follows an epoch-based approach.
Advantages: The learning adaption discussed here provides the
extendibility of an existing learning technique based on its overall
methodology, in this case, particularly the approach of using true
clusters to objectively indicate the notion of correctness in the domain
by capturing the subjective reasoning of experts. Moreover, it allows the
plugging in of similar formulae within the internal details of the
technique, in this case, for error calculations and weight adjustment
heuristics, thus not requiring new derivations. Furthermore, it promotes
code reuse, highly desirable in software engineering. It also preserves
the computational complexity of the learning algorithm, more or less.
This paper discusses the extension of distance metric learning
approaches to non-metric spaces. We have considered the adaptation of
learning, challenges, complexity and advantages. This work would
appeal to the machine learning and scientific data mining communities.
It could be enhanced by potential collaboration with other researchers.
References
[1] Dougherty, S. and Liang, L. Fabrication of Segmented Nanofibers by
Template Wetting of Multilayered Alternating Polymer Thin Films,
Journal of Nanoparticle Research, 11:743 (2009).
[2] Han J. and Kamber M: Data Mining Concepts and Techniques,
Morgan Kaufmann, (2001).
[3] Varde A., Rundensteiner E., Ruiz C., Maniruzzaman R. and Sisson
Jr. S. Learning Domain-specific distance metrics for plots of scientific
functions, Journal of Multimedia Tools and Applications, Springer,
35:29-53 (2007).
[4] Varde A., Bique S., Rundensteiner E., Brown D., Liang, J,, Sisson Jr.
R., Sheybani E. and Sayre E.. Component Selection to Optimize
Distance Function Learning in Complex Scientific data Sets,
International Conference on Databases and Expert System Applications,
Springer, 269-282 (Sep 2008).