Content uploaded by Angela Carter-McAuslan
Author content
All content in this area was uploaded by Angela Carter-McAuslan on Oct 15, 2020
Content may be subject to copyright.
Application of SOMs and k-means clustering to geophysical mapping – Lessons learned
Angela Carter-McAuslan* and Colin Farquharson, Memorial University of Newfoundland
Summary
Machine learning techniques are of growing interest to the
geosciences. We discuss the use of self-organizing maps and
k-means clustering as techniques for the analysis of potential
field and radiometric data in the production of predictive
maps. By looking at examples of predictive maps produced
for surface geology on the Baie Verte Peninsula,
Newfoundland, Canada as well as basement geology of the
mid-continent rift in Decorah, Iowa, USA we show the
benefits of a combined SOM – k-means clustering method
over using SOMs or k-means clustering as stand-alone
methods.
Introduction
Machine learning is a broad and fascinating area of study
with varied applications within the earth sciences. One such
application is the generation of automated predictive maps
from remote sensing and geophysical data. In this
presentation we will be discussing two types of unsupervised
machine learning techniques: self-organizing maps (SOMs)
and k-means clustering. K-means clustering (Macqueen,
1967) is a well understood and simple machine learning
technique. However, k-means clustering cannot be applied
to incomplete datasets. Self-organizing maps are a type of
unsupervised neural network algorithm used for the analysis,
visualization and interpretation of multi-dimensional
datasets first developed by Kohonen (1982). They have been
applied to a number of geoscience problems including
bedrock mapping (Carneiro et al. 2012) but are less straight
forward than k-means clustering. In this study we look at the
use of SOMs and k-means clustering separately and in
conjunction with one another for the task of producing
predictive geological maps.
We will use examples from two studies, one producing a
surface geophysical mapping of surface geology in Baie
Verte, Newfoundland, and one for buried geology in
Decorah, Iowa. The Baie Verte study allows for a
comparison between the results of the SOM process and the
combined SOM – k-means clustering process when applied
to legacy geophysical data for the purposes of mapping near
surface geology. Due to issues with data completeness the k-
means clustering as a stand-alone process is not applicable.
The Decorah, Iowa study allows for the comparison between
maps produced using the SOM process, pure k-means
clustering and the combined SOM – k-means clustering
process on relatively modern co-located datasets.
Self-Organizing Maps Theory
In this presentation, we will be showing maps produced
using the CSIRO SOM implementation SiroSOM. SiroSOM
is based on the MATLAB Toolkit (Leväniemi et al., 2017)
developed by Kohonen et al. (1996). The following is a
general mathematical description of the SOM algorithm
based primarily on Kohonen et al. (1996) and Kohonen
(1998).
All applications of SOM algorithms begin with a set of p
data vectors
!" #" $%
&
!' %
&
"' ( ( ( ' %
&
#)
,
where each
%
&
$
is a D-dimensional vector
%
&
$# $*!' *"' + ' *%)
.
Each element,
*&
, is the value of a different data type. In this
study the
*&
are different types of geophysical data.
SOM neural nets consist of a set of N neurons (also called
computational units or model vectors) of D-dimensions
, #
$
-
.
.
&
!' -
.
.
&
"' + ' -
.
.
&
'
)
where N < p. The goal is to “train” the neural net such that
the neural vectors exist on a D-dimensional data space
mimicking the distribution of the observation vectors, X.
Training is a recursive regression where each step involves
the presentation of a sample observation vector
%
&
(
to the
neural net to determining the neural vector most similar to
%
&
(
(which then becomes known as the best matching unit or
BMU associated with
%
&
(
) denoted as
-
.
.
&
)*(
. The BMU is
determined using the criteria
/
-
.
.
&
)*( 0 %
&
(
/
1
/
-
.
.
&
$*( 0 -
.
.
&
)*(
/(
"
The neural net is updated as per the objective function
-
.
.
&
)*(+! # -
.
.
&
$*( 2 3)
,
-
.
*$
4
%
&
(0 -
.
.
&
$*(
5(
"
The objective function combines competitive and co-
operative learning through the neighborhood function
3)
,
-
.
*$
. The neighborhood function encodes the magnitude of
the change to
-
.
.
&
$*(
based on its proximity in the data space to
-
.
.
&
)*(
. The exact form that the neighborhood function takes is
specific to each SOM implementation. However, in all
predictive maps, the BMU is modified the most. The amount
of change to other neurons in the net is dependent on their
proximity in neural net space to the BMU. The learning rate
(amount of modification to the neural vectors) and size of
the affected neighborhoods generally decreases with each
iteration of training. In this implementation both the learning
rate and neighborhood function decrease linearly (Kohonen
et al.,1996).
Once trained, the data are clustered using the neural net by
grouping data points according to their BMU. As such, the
neurons in the neural net become representative of small
clusters of data. If trained correctly, the neural net should
mimic the topology of the original dataset in data space (i.e.
SOM and k-means Clustering – Lessons learned
similar data which are close together in data space should
have BMUs that are close together on the neural net). If a
more in-depth explanation of SOM theory is desired, please
refer to Kohonen (1982).
K-means Clustering Theory
K-means clustering (Macqueen, 1967) is a simple, well
understood machine learning technique for the partitioning
of a set of p observations
!" #"
6
%
&
!' %
&
"' ( ( ( ' %
&
#
7
into k clusters represented by a set of centroids
8" #"
6
9
&
!' 9
&
"' ( ( ( ' 9
&
#
7
where
9
&
$# $:!' :"' + ' :%)
by minimizing the objective function
; #
< < 4/
%
&
$0 9
&
&
/5
"
/
&0!
#
$0!
.
The type of optimization used to minimize J is specific to the
k-means cluster implementation.
Example 1: Baie Verte, Newfoundland Canada
The Baie Verte Peninsula is a region of complex geology
consisting of siliclastic schist and felsic plutonic rocks
associated with the coastline of ancestral Laurentia separated
by the Baie Verte Line, a suture associated with the Taconic
orogeny, from seafloor and ocean island arc rocks in the east.
The area is host to base metal and gold deposits both
historically mined and currently in production.
A suite of legacy geophysical data (Figure 1) is available
from the Geological Survey of Canada for the peninsula. In
this study we used gravity, reduced-to-pole magnetic, and
radiometric data compiled from the 1987 Springdale Survey,
the 1988 Baie Verte Peninsula Survey, and the 2007 Baie
Verte Survey.
Figure 1: Datasets used for the Baie Verte SOM.
The SOM predictive maps were created using a neural net of
52x43 neurons, and the k-means clustering was carried out
using 14 centroids. Figure 2 b and c show the results for the
stand-alone SOM and the SOM – k-means cluster combined
processes respectively.
(a)
(b)
(c)
Figure 2: (a) Generalized geology of the Baie Verte Peninsula
(after Coleman-Sadd, 1996) and the results of the SOM carried
out using (b) only the SOM process and (c) a combination of the
SOM and k-means clustering.
The straight SOM result (Figure 2b) shows a good amount
of success in the differentiation of the different lithological
units, particularly, the Dunamagan Granite (labelled C in
Figure 2a) and the distinction between the schist and felsic
intrusive rocks of the Humber zone. With the addition of
secondary k-means clustering (Figure 2c) additional units
SOM and k-means Clustering – Lessons learned
are delineated. For example, the Lushes Bight and
Springdale groups (labelled E and D respectively on Figure
2a) are clearly differentiated with the extent of the groups
well replicated by the k-means clusters.
Example 2, Decorah, Iowa, USA
The mid-continent rift is a failed Precambrian rift system
(Stein et al., 2014). A suite of high-resolution airborne
magnetic, gravity and gravity gradiometry data were
collected by the USGS (Figure 3) over the Decorah area in
northeast Iowa and southeast Minnesota where the basement
rocks, associated with the mid-continent rift, are buried
beneath up to 700m of Paleozoic limestones of the Michigan
basin. Drenth et al. (2015) produced a traditional geological
interpretation from the geophysical data (Figure 4a).
Figure 3: Data used in the Decorah SOM.
The SOM and k-means clustering procedures were applied
to the high-resolution USGS magnetic, gravity, and gravity
gradiometry datasets. The SOM predictive maps were
created using a neural map of 62 neurons x 59 neurons in a
hexagonal formation laid out on the surface of a toroid. The
k-means clustering was carried out using 7 cluster centroids.
Figure 4b and c show the results of the stand-alone SOM
process and the combined SOM and k-means clustering
process respectively.
The stand-alone SOM results (Figure 4b) and the SOM
results with secondary clustering (Figure 4c) produce fairly
comparable results. Both locate and delineate many of the
geological units including the Decorah complex (labelled A)
and the mafic intrusions (labelled C, D, F, and G). However,
the results with secondary clustering delineate the silicic
pluton (labelled E) better and differentiate the Decorah
complex from the mafic intrusions.
(a)
(b)
(c)
Figure 4: (a) Geology of the Decorah, Iowa, area (after Drenth
et al., 2015) and the results of the SOM carried out using (a) only
the SOM process and (b) a combination of the SOM and k-means
clustering.
Conclusions
When good choices are made for the SOM and k-means
clustering parameters both are useful machine learning tools
for the interpretation of geophysical data. Making good
choices in the data used as well as the parameters selected
for running the algorithms allows for the production of good
predictive maps. In both examples presented the stand-alone
SOM was able to reproduce the geology to some degree, but
the addition of the k-means clustering resulted in a clear
improvement.
Acknowledgments
We would to acknowledge CSIRO for their provision of
SiroSOM at reduced rates as well as for their technical
support.