Conference PaperPDF Available

Exploratory Hot Spot Profile Analysis Using Interactive Visual Drill-Down Self-Organizing Maps

Authors:

Abstract and Figures

Real-life datasets often contain small clusters of unusual sub-populations. These clusters, or ‘hot spots’, are usually sparse and of special interest to an analyst. We present a methodology for identifying hot spots and ranking attributes that distinguish them interactively, using visual drill-down Self-Organizing Maps. The methodology is particularly useful for understanding hot spots in high dimensional datasets. Our approach is demonstrated using a large real life taxation dataset.
Content may be subject to copyright.
Exploratory Multilevel Hot Spot Analysis:
Australian Taxation Office Case Study
Denny1,2Graham J. Williams3,1Peter Christen1
1Department of Computer Science,
The Australian National University,
Canberra 0200, Australia,
Email: denny@cs.anu.edu.au,peter.christen@anu.edu.au
2Faculty of Computer Science,
University of Indonesia
3The Australian Taxation Office,
Email: graham.williams@ato.gov.au
Abstract
Population based real-life datasets often contain
smaller clusters of unusual sub-populations. While
these clusters, called ‘hot spots’, are small and sparse,
they are usually of special interest to an analyst.
In this paper we introduce a visual drill-down Self-
Organizing Map (SOM)-based approach to explore
such hot spots characteristics in real-life datasets. It-
erative clustering algorithms (such as k-means) and
SOM are not designed to show these small and sparse
clusters in detail. The feasibility of our approach is
demonstrated using a large real life dataset from the
Australian Taxation Office.
Keywords: self-organizing maps, cluster analysis,
neural network, imbalanced data, drill-down, visual-
ization.
1 Introduction
Cluster analysis is often used to help in understanding
and dealing with the complexities of large datasets.
For example, it may be easier to devise marketing
strategies based on groupings of customers sharing
similar characteristics because the number of group-
ings/clusters can be small enough to make the task
manageable.
Self-Organizing Map (SOM) (Kohonen 1982) is a
popular tool for cluster analysis for several reasons.
First, SOM performs topological mapping from high-
dimensional data into a two-dimensional map where
similar entities are placed nearby. Second, SOM per-
forms vector quantization which produces a smaller
representative dataset that follows the distribution of
the original dataset. Third, SOM offers various vi-
sualizations which are relatively easy to interpret for
non-technical users when exploring a dataset. Appli-
cations of SOM for cluster analysis can be found in
many domains, such as health (Markey et al. 2003,
Viveros et al. 1996) or marketing (Dolnicar 1997).
In real life, cluster sizes are normally not equal and
clusters do not have the same interestingness. Distri-
bution of clusters is often very skewed as captured by
the Pareto distribution (Pareto 1972) also known as
the “80:20 rule”. Thus, the interesting clusters are
Copyright c
2007, Australian Computer Society, Inc. This pa-
per appeared at the Sixth Australasian Data Mining Confer-
ence (AusDM 2007), Gold Coast, Australia. Conferences in Re-
search and Practice in Information Technology (CRPIT), Vol.
70. Peter Christen, Paul Kennedy, Jiuyong Li, Inna Kolyshkina
and Graham Williams, Ed. Reproduction for academic, not-for
profit purposes permitted provided this text is included.
usually only a small fraction of a dataset. Further-
more, the variance of items at the tail or margin of
the normal distribution of a population is also larger
compared to the center of the normal distribution. In
other words, in real life it is common to find large
dense clusters for common sub-populations and small
sparse clusters for interesting sub-populations. In a
taxation context this could be a group of tax enti-
ties who have a tax debt, while in an insurance con-
text this could be a group of high claiming clients.
Williams (1999) proposed the hot spots methodology
that aims to identify important or interesting groups
in a very large dataset. The methodology uses a com-
bination of clustering and rule induction. As a re-
sult, business organizations can make improvements
on their strategies, such as treatment strategies to im-
prove tax compliance, by understanding these small
and interesting clusters that are called hot spots. It
can be interesting to analyze these hot spots in rela-
tion to the whole population.
However, iterative clustering algorithms (such as
k-means) and SOM tend to merge these small sparse
clusters, thus reducing the ability to analyze them
in detail. The k-means algorithm tries to generate a
relatively uniform distribution on the cluster sizes as
shown by Xiong et al. (2006). As a result, k-means is
unsuitable for highly skewed datasets.
When SOM is used for cluster analysis, it also has
similar issues. Increasing the map size of a SOM only
gives a better resolution map (in terms of lower quan-
tization error and finer cluster borders) but with sig-
nificant additional computational cost. However, an
increased map size does not provide extra information
about these small and sparse clusters. Small sparse
clusters are represented as a few nodes in a SOM,
which reduces the capability to characterize them.
Hierarchical clustering algorithms (Han & Kam-
ber 2006), on the other hand, require high compu-
tational resources, thus making them impractical for
very large datasets. Furthermore, different definitions
of between cluster distances (such as minimum, max-
imum, or average distance) will often produce differ-
ent clustering results. Moreover, the definition of the
between cluster distance has to be determined before-
hand.
Therefore, the approach presented in this paper is
aimed to help analysts to identify and understand hot
spots behaviour. The main contribution of our ap-
proach is drill-down hot spot exploration using SOM-
based visualizations that capable in handling imbal-
anced data.
The rest of the paper is organized as follows. Sec-
tion 2 briefly introduces SOMs and explain their limi-
0
1
2
0
1
2
Figure 1: Local lattice structure: hexagonal topol-
ogy (left) and rectangular topology (right) and its
neighbourhood radius in the map space (adapted
from Vesanto et al. (2000)).
tation for analyzing hot spots. Section 3 reviews cur-
rent SOM-based clustering techniques. Our approach
is discussed in Section 4 and Section 5 discusses the
results of our experiments with a real life dataset from
the taxation domain.
2 Self-Organizing Maps
A SOM is an artificial neural network that performs
unsupervised competitive learning (Kohonen 1982).
Importantly, SOMs allow the visualization and ex-
ploration of a high-dimensional data space by non-
linearly projecting it onto a lower-dimensional man-
ifold, most commonly a 2-D plane (Kohonen 2001).
Artificial neurons are arranged on a low-dimensional
grid. Each neuron ihas an n-dimensional prototype
vector, mi, also known as a weight or codebook vec-
tor, where nis the dimensionality of the input data.
Each neuron is connected to neighbouring neurons,
determining the topology of the map. In a hexago-
nal grid, each neuron is connected to six neighbours,
while in a rectangular grid each neuron is connected
to four neighbours, as shown in Figure 1. In the map
space, neighbours are equidistant.
SOMs are trained by presenting data vectors to the
map and adjusting the prototype vectors accordingly.
These prototype vectors are initialized to different
values. There are two approaches to training a SOM:
sequential training and batch training. In sequential
training, one data vector is presented to the map at
a time and the prototype vectors are updated. On
the other hand, in batch training, the whole dataset
is presented to the map and all prototype vectors are
updated at once.
In sequential training, the training vectors can be
taken from the dataset in random order, or cycli-
cally. At each training step t, the Best Matching Unit
(BMU) bifor training data vector xi, i.e. the proto-
type vector mjclosest to the training data vector xi,
is selected from the map according to Equation 1:
j, kximbi(t)k ≤ kximj(t)k,(1)
where only non-missing values are used in the distance
calculation. Then, the prototype vectors of node bi
and its neighbours are moved closer to xi:
mj(t+ 1) = mj(t) + α(t)hbij(t)[ximj(t)],(2)
where α(t) is the learning rate (a tuning parame-
ter) and hbij(t) is the neighbourhood function (often
Gaussian) centered on bi. This process of updating
the prototype vectors is repeated until a predefined
number of iteration or epochs is completed. Both
α(t) and the radius of hbij(t) are decreased after each
iteration. Since the time complexity of SOMs is linear
in the number of prototype vectors, number of data
vectors, and number of iteration, SOMs are able to
cope with large and high-dimensional datasets.
In the batch algorithm, the values of new proto-
type vectors are the weighted averages of the training
data vectors that are mapped to mjand its neigh-
bours, where the weight is the neighbourhood ker-
nel value hbijcentered on unit bi(Kohonen 2001).
The new prototype vectors are calculated using Equa-
tion 3.
mj(t+ 1) = PN
i=1 hbij(t)xi
PN
i=1 hbij(t),(3)
where Nis the number of training data vectors. SOM
is capable in handling missing values, as Equation 3
only performs summation and counting of the non-
missing values.
The batch algorithm is similar to k-means. The
difference is that the batch algorithm uses weights
in calculating the new ‘centroids’ that are based on
the chosen neighbourhood kernel function, while k-
means assigns the same weight (weight of one for data
vectors assigned to a cluster, weight of zero for the
rest) when calculating the centroids.
The map is usually trained in two phases: rough
training phase and fine tuning phase. The rough
training phase usually has shorter training length and
wider initial radius compared to fine tuning phase. In
the rough phase, the learning rate α(t) and the radius
of hbij(t) decrease in a faster rate compared to the fine
tuning phase.
After a SOM is trained using a real life dataset, the
common population is usually located in the center
of the map and the remainder at the border, because
of the topologically ordering property and the neigh-
bourhood kernel function used in the training. In
real life datasets, the remainder of a population usu-
ally has a few different characteristics compared to
the common population. For example, in a taxation
context, entities who rely mainly on salary and wages
for income are mapped onto the center of the map
since they are the common population. Other enti-
ties might have a few variations, such as having salary
and wages and interest income; or having salary and
wages, interest, and dividend income.
Since we are interested in the hot spots or ‘un-
common but interesting clusters’, these clusters are
usually located at the border of the map. However,
SOMs have a problem with an issue called the border
effect (Kohonen 2001). The neighbourhood defini-
tion is not symmetric at the borders of the map. As
shown in Figure 1, the number of neighbours per unit
on the border and corner of the map is not equal to
the number of neighbours in the middle of the map.
Therefore, the density estimation for the border units
is different to the units in the middle of the map (Ko-
honen 2001). As a result, the tails of the marginal
distributions of variables (normally located at border
units) are less well represented than their centers. As
we are interested in hot spots, and these hot spots are
usually located at the borders of the map, there is a
need to address this problem.
Besides the single level SOM proposed originally
by Kohonen (1982), there are SOMs with hierarchical
structure, such as Hierarchical SOM (Koikkalainen &
Oja 1990) and Growing-Hierarchical SOM (Ditten-
bach et al. 2000). In these approaches, only one node
can be drilled down to the next level. The problem
of drilling down only one node at a time is that the
Voronoi border of the prototype vector in a sparse
area might not be a good cut of the entities in a
hot spot area. Furthermore, the goal of Hierarchi-
cal SOM is to achieve lower computational cost by
using a Tree-Structured SOM to find a BMU faster.
Figure 2: The distance matrix visualization of the
whole population dataset, where distance is the me-
dian of distances a node to its neighbours.
In our approach, several nodes can be selected to be
drilled down interactively by feedback from the user.
3 SOM-based clustering
As mentioned earlier, SOMs perform vector quan-
tization and projection to a 2-D map, and have a
topology-retaining property. This makes SOMs suit-
able for clustering data based on their spatial rela-
tionships on the map using visualizations. Existing
SOM-based clustering methods can be categorized
into visualization based clustering, direct clustering,
and two-level clustering (hybrid) as discussed below.
A rough cluster structure can be observed using
a distance-matrix based visualization. The distance-
matrix based visualization, such as u-matrix visual-
ization (Iivarinen et al. 1994), shows distances be-
tween neighbouring nodes using a colour scale rep-
resentation on a map grid, as shown in Figure 21.
As shown in the colour bar, white indicates a short
distance between a node and its neighbouring nodes,
while black indicates a long distance between the node
and its neighbours. The distance matrix visualiza-
tion methods can be used to show borders between
clusters. Long distances that show highly dissimilar
features between neighbouring nodes divide clusters,
i.e. the dense parts of the maps with similar features
(white regions) (Iivarinen et al. 1994). In other words,
the distances of the neighbouring units in the data
space are represented using shades of colour in the
map space.
By using this visualization, users can see the clus-
ter structure of the dense part of the map, for example
the center of the map (region marked ‘A’) in Figure 2.
However, it is difficult to see the cluster structure of
the sparse parts at the lower-right and the upper-left
corners of the map (regions marked ‘B’ and ‘C’).
Another method to analyze a hierarchical clus-
ter structure is by using a variant of the data hit
histogram that shows how many data vectors are
mapped to each node. This is called “Smoothed Data
Histogram” (SDH) and proposed by Pampalk et al.
(2002). In this visualization technique, each data vec-
tor is mapped to its sclosest units (BMU) with a lin-
early decreasing membership degree. The first BMU
has a s/csdegree of membership, the second BMU
has a (s1)/cs, and so forth for the sclosest units.
The remainder units have zero degree of membership.
Pampalk et al. (2002) define cs=Ps1
i=0 (si) to en-
sure the total membership of each data item adds up
1All the SOM figures were originally in colour. For printing
purposes, they were converted into gray scale and therefore some
details are lost. In the original version, for example, low values are
represented as shades of blue and high values are represented as
shades of reds.
to 1. They argue that a hierarchical cluster structure
in the data can be observed by changing the value
of s. The drawback of this visualization technique
is sensitive to the parameter s. The authors did not
give any heuristics to choose a suitable value of s.
They argued that the optimal value of the smoothing
parameter depends on an application. Furthermore,
large values of swill give more value to the units at
the center of the map due to the topological ordering
property of a SOM.
This technique might be able to visualize cluster
structure of the dense parts of the map. However, this
approach cannot show the hierarchical structure of a
sparse part (hot spot) of a map due to the limitation
of SOM as described in Section 2.
In direct clustering, each map unit is treated as a
cluster, its members being the data vectors for which
it is the BMU. This approach has been applied to
a breast cancer database (Markey et al. 2003), to a
health insurance industry (Viveros et al. 1996) and
for market segmentation (Dolnicar 1997).
A disadvantage is that the map resolution must
match the desired number of clusters, which must
be determined in advance. Furthermore, taking each
map unit as a cluster centroid does not guarantee that
the clustering result will minimize within-cluster dis-
tances and maximize between-cluster distances since
SOMs will produce more units for large clusters.
Again, this technique cannot show the cluster struc-
ture of the sparse part of a map due to the limitation
of SOM.
In contrast to direct clustering, in two-level clus-
tering, the units of a trained SOM are treated
as ‘proto-clusters’ serving as an abstraction of the
dataset (Vesanto & Alhoniemi 2000). Their proto-
type vectors are clustered using a traditional cluster-
ing technique, such as k-means or agglomerative hi-
erarchical clustering, to form the final clusters. Each
data vector belongs to the same cluster as its BMU.
When a SOM is used in the first level of the pro-
cedure, it leads to two advantages. Firstly, the orig-
inal data vectors are characterized by a considerably
smaller-sized set of prototype vectors, allowing effi-
cient use of clustering algorithms to divide the proto-
types into groups, as shown by Vesanto & Alhoniemi
(2000). As a result, this approach is suitable for large
or high-dimensional datasets, such as genome data,
and for obtaining an initial understanding of possible
clusters. For example, after the optimal number of
clusters is decided, based on data exploration of the
clustering of the maps, clustering with that number
of clusters can be performed directly on the data vec-
tors instead of on the prototype vectors, if desired.
Furthermore, it allows a visual presentation and in-
terpretation of the clusters via the 2-D grid.
The two-level clustering method also has the same
drawback as the previously mentioned methods, as it
also uses SOM as the abstraction layer. It is not pos-
sible to see the cluster structure of the sparse part of
the map, even when using an agglomerative hierar-
chical clustering on top of the map.
In detecting changes in cluster structure using
SOM, Denny & Squire (2005) used two level clus-
tering as described previously and multiple visualiza-
tion linking to show how clusters change over time,
such as emerging clusters, missing clusters, enlarging
clusters, and shrinking clusters. Their method were
tested using synthetic and real-life datasets using the
World Development Indicator data published by the
World Bank (World Bank 2003). The results verify
that the methods are capable of revealing changes in
cluster structure, corresponding to known changes in
economic fortunes of countries.
4 Our Visual Drill-Down Approach
Our visual SOM drill-down approach is applied to
a real-life dataset from the Australian Taxation Of-
fice (ATO). In this section, we discuss data pre-
processing, map training, identifying hot spots, and
drilling-down the hot spots.
4.1 Dataset
Due to data confidentiality, the complete data de-
scription and results cannot be shown in this paper.
However, we do provide aggregate indicative results
that demonstrate the effectiveness of our approach.
The motivation of the analysis is to understand
the logic and structures that drive tax payers’ com-
pliance behaviour (behavioural archetypes). The idea
is to construct ‘psychographic groups’ (Wells 1975) by
using data mining. Understanding the difference be-
tween low and high risk tax payers will be valuable
for the ATO.
The archetype dataset consists of about 6.5 mil-
lion entities with 89 numerical attributes which reflect
tax payers behaviour. In general, these attributes can
be categorized into: income profile (amount and pro-
portion of each income source), propensity to lodge
correctly and on time (lodgement profile), propen-
sity/capacity to pay (debt profile), market segments,
demographics, Socio-Economic Indicators for Areas
(SEIFA) (Trewin 2003), and involvement in Aggres-
sive Tax Planning. These attributes were manually
selected by the ATO’s analysts.
4.2 Data Preprocessing
In distance-based clustering methods, it is important
to perform normalization prior to clustering since
attributes might have different scale/range (Han &
Kamber 2006). Without normalization, attributes
with larger ranges will have more influence on the
distance measurement. Common normalization tech-
nique are: z-score normalization, min-max normaliza-
tion, and decimal scaling.
In the dataset, we found that some attributes have
a large range to variance ratio. When all of the at-
tributes in the dataset are normalized using z-score,
the normalized values of these attributes will still have
larger ranges.
The range of the z-score normalized value
(rangeA0) can be calculated as the range in the orig-
inal dataset (rangeA) divided by the standard devia-
tion of the original dataset (σA) as shown below. The
normalized value v0of attribute Acan be calculated
by: v0=vA
σA.
rangeA0=maxA0minA0
=maxAA
σAminA
σA
=maxAminA
σA=rangeA
σA
where Ais the mean, minAand maxAare the min-
imum and maximum value of the original attribute
values, and minA0and maxA0are the minimum and
maximum of the normalized values. Therefore, when
an attribute has a large range to variance ratio, the
range of the normalized value would be high, out-
weighing other attributes in the distance calculation.
Therefore, it is suggested to use a mixed normaliza-
tion method, such as z-score and min-max normaliza-
tion, or use weight coefficients in the distance calcu-
lation.
As SOMs can only handle numerical attributes, all
non-numerical attributes have to be transformed into
numerical attributes. Categorical attributes, such as
market segmentation and lodgement channel, are con-
verted into numerical attributes by encoding each cat-
egorical value into a binary attribute. Furthermore,
some numerical attributes that can have negative and
positive values are split into two new variables that
only contain the positive values or only the negative
values to make it easier to interpret the result.
4.3 Map Training
The map is initialized using linear initialization (Ko-
honen 2001), and trained in two phases using batch
training. In linear initialization, the prototype vec-
tors are initialized based on the two largest principal
components. Linear initialization is chosen over ran-
dom initialization because it speeds up the learning
process by an order of magnitude by having shorter
training lengths (Kaski & Kohonen 1998). Further-
more, linear initialization combined with batch train-
ing will produce the same map if the learning process
were redone. Random initialization might produce
different orientations of the map.
Batch training is chosen because it produces more
stable asymptotic values for the prototype vectors
and it does not have the convergence problem of se-
quential training (Kohonen 2001). Furthermore, with
a batch training algorithm, it is possible to utilize
multi-processor environments to speed up the train-
ing process.
The map size, training length, initial and final ra-
dius are chosen by considering the best practice, as
suggested by Vesanto et al. (2000).
4.4 Identifying Hot Spots in Self-Organizing
Maps
Generally, in business, users are more interested in
“abnormal clusters” or hot spots (e.g. clusters of en-
tities who have debts) than “normal clusters”. Hot
spots in SOMs can be identified by two approaches,
by using the distance matrix visualizations as well as
analysts’ feedback based on component plane visual-
izations.
With the idea that entities in hot spots are usually
less homogenous because they are often located at the
tail of distributions compared to the common/regular
entities, these regions can be identified by using the
distance matrix. Using distance matrix visualiza-
tions, homogenous groups (low variation) will have
shorter neighbour distances (the white regions) com-
pared to high variation groups (the dark regions) as
shown in Figure 2. Then, regions that have longer dis-
tances should be investigated further by using com-
ponent plane visualizations.
Component planes show the spread of values of
a certain component of all prototype vectors in a
SOM (Tryba et al. 1989). The value of a component
in a node is the ‘average’ value of entities in the node
and its neighbours according to the neighbourhood
function and the final radius used in the final train-
ing (Equations 2 and 3). The colour coding of the
map is created based on the maximum and the min-
imum values of the component of the map. In this
paper, we use the ‘gray’ colour map where the maxi-
mum value is assigned black and the minimum value
is assigned white. Component planes can be used to
see interesting cluster patterns and correlations be-
tween variables (Himberg 1998, Vesanto 1999)
In Figure 2, there are two hot spots according to
the aforementioned criteria, one in the top-left corner
(region marked ‘B’) and another one in the bottom-
right corner (region marked ‘C’). According the com-
ponent planes, such as the component plane of the
Component plane: Cnt_IT_Debt_Cases
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
1.4912
1.2782
1.0651
0.8521
0.6391
0.4261
0.2130
0.0000
Figure 3: Component plane of ‘number of debt cases’
of the whole population.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
1.3032
1.1170
0.9308
0.7447
0.5585
0.3723
0.1862
0.0000
Figure 4: Component plane of ‘number of debt cases
paid’ of the whole population.
number of debt cases as shown in Figure 3, and do-
main expertise, the hot spot in the bottom-right cor-
ner is more interesting than the one in the top-left
corner. The bottom-right corner region consists of
entities who have debt, have high taxable income, are
involved in aggressive tax planning schemes, and have
high risk scores. The top-left corner, on the other
hand, consists of entities who received allowances and
have more amendments.
The entities in the bottom-right region have highly
dissimilar characteristics. However, at this level, it is
difficult to differentiate the debt behaviour as shown
in Figures 3 and 4. Therefore, it is a good idea to drill
down into this region as discussed in the next section.
In identifying hot spots, the domain knowledge
of analysts is invaluable because some attributes are
more interesting compared to others. In this case, for
example: involvement in tax schemes, lodgement be-
haviours, number of debt cases, and taxable income,
are more interesting in identifying hot spots compared
to market segmentation.
4.5 Drill Down and Visualizing Hot Spots
After analysts choose a part of the top level map (dis-
tinguish this group as a hot spot) that is interesting
to be explored, a sub-map of the region is trained
using entities that are mapped to the chosen region.
Some issues that need to be taken care of in train-
ing the sub-map are: consistency of interpretation of
the visualization of the sub-map, and maintaining the
sub-map quality with respect to the sub-population.
In order to make interpretation of the visualization
of the sub-map consistent to the analysts, the orien-
tation of the map should be preserved and the colour
coding should be consistent. The drawback of using
linear initialization for the sub-map based on the en-
tities in the sub-map is that the orientation of the
sub-map might be different to the orientation of the
Component plane: Seifa1_05
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
1.173K
1.132K
1.091K
1.050K
1.009K
968.17
927.07
885.97
Figure 5: Component plane of SEIFA of the sub-map
of region marked ‘C’ in Figure 2.
top level map. For example, the debt entities were lo-
cated at the bottom-right corner of the top level map
but they might be located at the top-left corner as we
drill down. This might confuse the user. This could
happen when the two largest principal components
of the whole population and the sub-population are
different.
Therefore, it is suggested that the top level map is
used as the initial map of the sub-map. The radius of
the rough phase training should be wide enough, oth-
erwise parts of the map might be empty (no entities
mapped to particular nodes). Therefore, as a guide,
the initial radius of the rough phase can be half of
the longest side and the initial radius of the fine tune
phase can be a quarter of the longest side.
The sub-map can be visualized using distance ma-
trix visualization and component plane visualization.
In order to show the distribution of values of the sub-
map with respect to the whole population, it is sug-
gested that when showing the component planes of
the sub-map, the colour map used for the whole pop-
ulation, as described in Section 4.4, is used to visu-
alize the component planes of the sub-map. In other
words, black colour in the sub-map visualizations is
used for the maximum value of the component of the
top level map, not the maximum value of the com-
ponent of the sub map. For example, Figure 5 shows
the distribution of Socio-Economic Indicator for Ar-
eas of the bottom-right corner of the whole map. As
the sub-map has better quality in terms of quanti-
zation error (more homogenous/less variation of the
entities mapped to a node), the component value in
the sub-map might exceed the maximum value of the
whole map. The colour for values more than the max-
imum value of the whole map would be black as well.
Therefore, when a cluster of black nodes appears in
the visualization, it is possible that the values are ac-
tually exceeding the values of black in the colour bar.
The training of the sub-map will be considerably
faster than training of the whole population as the
number of data vectors mapped to the region are con-
siderably smaller. Therefore, it is possible for users
to explore hot spots interactively.
5 Results and Discussion
To interpret multiple visualizations, analysts need to
understand that these visualizations are linked by po-
sition or by colour. Visualization of the same map is
linked by position, which means that the position of
each entity remains the same in each visualization.
For example, Figures 2, 3, and 4 are linked by po-
sition. Visualization of the whole map and the sub
map is linked by colour as described previously. The
colour map of the top level map is used as the colour
map in the sub-map.
Component plane: Employees_Mkt05
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
0.9995
0.8714
0.7433
0.6152
0.4872
0.3591
0.2310
0.1029
Figure 6: Component plane of ‘employee market’ of
the whole population. Value of 1.0 means that the
node consists of 100% employees.
Component plane: SW05_pc
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
120.29
97.329
74.359
51.389
28.419
5.4500
-17.51
-40.48
Figure 7: Component plane of percentage of salary
and wages to total income of the whole population.
In our experiments, the map size is 15x30, with
hexagonal lattice structure. The initial radius of the
rough phase and the fine tune phase are 8 and 4 re-
spectively. The training length for the rough phase
and the fine tuning phase are 6 and 10 epochs, respec-
tively. The training processes took about 5 hours on
a Debian GNU/Linux machine with two 64-bit AMD
dual-core 3 GHz processors and 16 GB memory using
our Java SOM Toolbox2.
As discussed in Section 2, the common population
in a real life dataset are usually located in the center
of the map. The entities in the center of the map
of the whole population are relatively homogenous as
shown in Figure 2. Based on the component plane vi-
sualizations, this common population mainly consists
of employees (Figure 6) with salary and wages as the
main source of income (Figure 7).
At this level, we can see that e-tax3is an income
tax return lodgement channel that is commonly used
by employees as shown in Figure 8. This is com-
mon sense where their tax returns are simple. How-
ever, the usage of the e-tax channel can be further
optimized since, as a group, only 40% of the entities
mapped to the darkest nodes of the map were using
e-tax. If necessary, ATO might want to promote e-tax
directly to the rest of the 60% of the group as they
have similar behaviour/characteristics.
At the whole population level, it is not possible
to differentiate debt behaviours because these enti-
ties are mapped to a small number of units at the
lower-right corner of the map, as shown in Figures 3
and 4. Debt behaviour can be differentiated by ob-
serving debt-related attributes of this sub-population,
such as total payment arrangements made, total de-
fault payment arrangements, total finalized payment
2Contact the author if you are interested in using the JavaSOM-
Toolbox.
3http://www.ato.gov.au/etax
Component plane: ChannelETAX
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
0.3974
0.3408
0.2842
0.2276
0.1709
0.1143
0.0577
0.0011
Figure 8: Component plane of ‘usage of e-tax lodge-
ment channel’ of the whole population.
Distance matrix
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
14.761
12.765
10.768
8.7718
6.7751
4.7784
2.7817
0.7849
Figure 9: Distance matrix visualization of the sub-
map of region marked ‘C’ in Figure 2.
arrangements, and age of debt.
In order to see the debt behaviour in detail, we
drill down the lower-right corner of the top level map
as explained in the previous section. At this level, we
can also use distance matrix (Figure 9) visualization
to highlight the hot spot at this sub-map. In Figure 9,
they are located at the bottom of the map.
In the sub-map, we are able to identify a group
with characteristics of nearly all of the debt cases paid
(Figures 10 and 11) but with higher latest debt stage.
It is interesting to note that these entities also live
in areas with slightly above average Social-Economic
Indicator for Areas (Figure 5) which could mean that
they might have the capacity to pay. This kind of
analysis is not possible at the whole population level,
as these entities are squeezed in a few nodes at the
whole map which make it difficult to differentiate.
It is also interesting to note that the hot spot of
the sub-map consists of entities that are involved in
aggressive tax planning activities as a promotor or a
participant. Furthermore, this group has character-
Component plane: Cnt_IT_Debt_Cases
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
1.4912
1.2782
1.0651
0.8521
0.6391
0.4261
0.2130
0.0000
Figure 10: Component plane of ‘number of debt cases’
of the sub-map of region marked ‘C’ in Figure 2.
Component plane: Cnt_Cases_Paid
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
1.3032
1.1170
0.9308
0.7447
0.5585
0.3723
0.1862
0.0000
Figure 11: Component plane of ‘number of debt cases
paid’ of the sub-map of region marked ‘C’ in Figure 2.
istics of longer debt age, higher stage of compliance
enforcement taken by the ATO, and lower percentage
of cases paid.
6 Conclusion and Future Work
We have highlighted the use of SOMs in exploring
hot spots in a large real world dataset from the taxa-
tion domain. Based on our experiments, our approach
is an effective tool for hot spots exploration since it
offers visualizations that are easy to understand for
non-technical users. Moreover, SOMs are able to han-
dle missing values, are computationally feasible for
large datasets, and are able to exploit multi-processor
environments. Furthermore, in using our approach,
users do not have to determine the number of clusters
nor between-cluster distance definition beforehand.
With our approach, users are able to select which
regions to drill down, whereas in agglomerative clus-
tering algorithms, the between-cluster distance for-
mula dictate how the population is split. There-
fore, the user would be able to select regions/clusters
based on their business drivers/needs. This is particu-
larly useful as some attributes have higher importance
compared to others.
This work is part of a larger research project where
we are interested on observing the dynamics of hot
spots over time such as to find entities who are mov-
ing in or out of hot spots. Such knowledge would be
valuable as the analysts can derive strategies to en-
courage or to deter people to move in or out the hot
spots; or evaluate effectiveness of their implemented
strategies.
References
Denny & Squire, D. M. (2005), Visualization of clus-
ter changes by comparing Self-Organizing Maps, in
T. B. Ho, D. Cheung & H. Liu, eds, ‘PAKDD’05’,
Vol. 3518 of Lecture Notes in Computer Science,
Springer, pp. 410–419.
Dittenbach, M., Merkl, D. & Rauber, A. (2000),
Growing hierarchical Self-Organizing Map, in ‘Pro-
ceedings of the International Joint Conference on
Neural Networks’, Vol. 6, Technische Universit¨at
Wien, IEEE, Piscataway, NJ, pp. 15–19.
Dolnicar, S. (1997), The use of neural networks in
marketing: market segmentation with self organ-
ising feature maps, in ‘Proceedings of WSOM’97,
Workshop on Self-Organizing Maps, Espoo, Fin-
land, June 4–6’, Helsinki University of Technology,
Neural Networks Research Centre, Espoo, Finland,
pp. 38–43.
Han, J. & Kamber, M. (2006), Data Mining: Con-
cepts and Techniques (second edition), Morgan
Kaufmann, San Francisco, CA.
Himberg, J. (1998), Enhancing the SOM-based
data visualization by linking different data projec-
tions, in ‘Proceedings of 1st International Sym-
posium Intelligent Data Engineering and Learn-
ing (IDEAL’98)—Perspectives on Financial Engi-
neering and Data Mining’, Springer, Hong Kong,
pp. 427–434.
Iivarinen, J., Kohonen, T., Kangas, J. & Kaski,
S. (1994), Visualizing the clusters on the Self-
Organizing Map, in C. Carlsson, T. J¨arvi & T. Re-
ponen, eds, ‘Proceedings of the Conference on Ar-
tificial Intelligence Research in Finland’, Vol. 12,
Finnish Artificial Intelligence Society, Helsinki,
Finland, pp. 122–126.
Kaski, S. & Kohonen, T. (1998), Tips for process-
ing and color-coding of Self-Organizing Maps, in
G. Deboeck & T. Kohonen, eds, ‘Visual Explo-
rations in Finance with Self-Organizing Maps’,
Springer, London, pp. 195–202.
Kohonen, T. (1982), ‘Self-organized formation of
topologically correct feature maps’, Biological Cy-
bernetics 43, 59–69.
Kohonen, T. (2001), Self-Organizing Maps (Third
Edition), Vol. 30 of Springer Series in Information
Sciences, Springer, Berlin, Heidelberg.
Koikkalainen, P. & Oja, E. (1990), Self-organizing hi-
erarchical feature maps, in ‘Proceedings IJCNN-
90, International Joint Conference on Neural Net-
works, Washington, DC’, Vol. 2, IEEE Service Cen-
ter, Piscataway, NJ, pp. 279–285.
Markey, M. K., Lo, J. Y., Tourassi, G. D. & Floyd
Jr., C. E. (2003), ‘Self-organizing map for cluster
analysis of a breast cancer database.’, Artificial In-
telligence in Medicine 27(2), 113–127.
Pampalk, E., Rauber, A. & Merkl, D. (2002), Us-
ing smoothed data histograms for cluster visual-
ization in self-organizing maps, in ‘Artificial Neu-
ral Networks - ICANN 2002: International Confer-
ence, Madrid, Spain, August 28-30, 2002. Proceed-
ings’, Vol. 2415/2002, Springer Berlin / Heidelberg,
pp. 871–876.
Pareto, V. (1972), Manual of Political Economy,
Macmillan, London. Translated by Ann S. Schwier.
Edited by Ann S.Schwier and Alfred N.Page.
Trewin, D. (2003), Socio-economic indexes for areas:
Australia 2001, Technical Report 2039, Australian
Bureau of Statistics.
Tryba, V., Metzen, S. & Goser, K. (1989), Designing
basic integrated circuits by self-organizing feature
maps, in ‘Neuro-Nˆımes ’89. International Work-
shop on Neural Networks and their Applications’,
ARC; SEE, EC2, Nanterre, France, pp. 225–235.
Vesanto, J. (1999), ‘SOM-based data visualization
methods’, Intelligent Data Analysis 3(2), 111–126.
Vesanto, J. & Alhoniemi, E. (2000), ‘Clustering of
the Self-Organizing Map’, IEEE Transactions on
Neural Networks 11(3), 586–600.
Vesanto, J., Himberg, J., Alhoniemi, E. & Parhankan-
gas, J. (2000), SOM toolbox for Matlab 5, Re-
port A57, Helsinki University of Technology, Neural
Networks Research Centre, Espoo, Finland.
Viveros, M. S., Nearhos, J. P. & Rothman, M. J.
(1996), Applying data mining techniques to a
health insurance information system, in T. M. Vi-
jayaraman, A. P. Buchmann, C. Mohan & N. L.
Sarda, eds, ‘Proceedings of 22th International Con-
ference on Very Large Data Bases (VLDB’96),
September 3-6, 1996, Mumbai (Bombay), India’,
Morgan Kaufmann, pp. 286–294.
Wells, W. D. (1975), ‘Psychographics: A critical
review’, Journal of Marketing Research (JMR)
12(2), 196–213.
Williams, G. J. (1999), Evolutionary hot spots data
mining - an architecture for exploring for interest-
ing discoveries, in ‘PAKDD ’99: Proceedings of the
Third Pacific-Asia Conference on Methodologies for
Knowledge Discovery and Data Mining’, Springer-
Verlag, London, UK, pp. 184–193.
World Bank (2003), World Development Indicators
2003, The World Bank, Washington DC.
Xiong, H., Wu, J. & Chen, J. (2006), K-means cluster-
ing versus validation measures: a data distribution
perspective, in ‘KDD ’06: Proceedings of the 12th
ACM SIGKDD international conference on Knowl-
edge discovery and data mining’, ACM Press, New
York, NY, USA, pp. 779–784.
... New strategies can then be devised to encourage or deter the development of new clusters or to slow the demise of clusters, as suits the requirements of the business. ReDSOM visualization allows users to explore the distinctive features of changes interactively using the hot-spot methodology [6]. Involving the user in the data exploration process is important in ensuring effective data analysis [12]. ...
... An increase of cluster density in dataset D(τ 2 ) can be identified in the relative density visualization as light blue, for example cluster 'B' as shown in the left map inFigure 2. On the other hand, a decrease of cluster density in dataset D(τ 2 ) can be identified in the relative density visualization as light red, for example cluster 'D' as shown in the left map inFigure 2. Analyzing interesting changes. Once a region of interest is selected interactively by a user, our hot spot methodol- ogy [6] can be used to understand distinctive features of these changing regions. In this methodology, the component planes are sorted by the importance of the attributes that distinguish the region from the rest of the population Relative density of map-lostNewCluster10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 48 49 49 50 50 51 52 53 54 55 56 56 57 57 58 59 60 61 62 63 64 64 65 65 66 66 67 68 69 70 71 72 73 73 74 75 75 76 77 78 79 80 80 81 81 82 82 83 83 84 85 86 87 88 88 89 90 90 91 92 93 94 Relative density of map-lostNewCluster20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 88 89 90 91 using an attribute selection measure [8] , such as information gain or gain ratio, as shown in Figures 5 and 7. ...
... This region consists of several African countries. Generated by applying hot-spot analysis [6],Figure 5 can be used to understand distinctive features of this shrinking region. After selecting this region, the sorted component planes show that this region is characterized by high illiteracy, high mortality rate, high percentage of children in the labor force, low ratio of physicians, and low school enrolment. ...
Conference Paper
Full-text available
We introduce a self-organizing map (SOM) based visualization method that compares cluster structures in temporal datasets using relative density SOM (ReDSOM) visualization. Our method, combined with a distance matrix-based visualization, is capable of visually identifying emerging clusters, disappearing clusters, enlarging clusters, contracting clusters, the shifting of cluster centroids, and changes in cluster density. For example, when a region in a SOM becomes significantly more dense compared to an earlier SOM, and well separated from other regions, then the new region can be said to represent a new cluster. The capabilities of ReDSOM are demonstrated using synthetic datasets, as well as real-life datasets from the World Bank and the Australian Taxation Office. The results on the real-life datasets demonstrate that changes identified interactively can be related to actual changes. The identification of such cluster changes is important in many contexts, including the exploration of changes in population behavior in the context of compliance and fraud in taxation.
... Visual data exploration enables involvement of users in the data mining process which is important in ensuring effective data analysis (Keim, 2002). Through interactive linking and brushing (as detailed in Section 4.3), ReDSOM visualizations allow users to explore the distinctive features of changes interactively using hot-spot methodology (Denny, Williams and Christen, 2008a). Furthermore, ReDSOM and cluster color linking (Denny and Squire, 2005) helps to distinguish between a new cluster occupies new data space and an existing cluster splitting into two sub-clusters. ...
... Analyzing interesting changes (post analysis). Once a region of interest is selected interactively by a user, hot spot methodology (Denny et al., 2008a) can be used to understand distinctive features of these changing regions using brushing and linking multiple visualization. In this methodology, the component planes of the map are sorted by the importance of the attributes that distinguish the region from the rest of the population using an attribute selection measure (Han and Kamber, 2006), such as information gain or gain ratio, as shown in Figures 8, 9 and 11. ...
... This cluster consists of four South American countries: Brazil, Argentina, Nicaragua, and Peru. Generated by applying hot-spot analysis (Denny et al., 2008a), Figure 8 can be used to understand distinctive features of this lost cluster. The sorted component planes show that these countries were suffering economic difficulties (e.g. ...
Article
Full-text available
We introduce a Self-Organizing Map (SOM)-based visualization method that compares cluster structures in temporal datasets using Relative Density SOM (ReDSOM) visualization. ReDSOM visualizations combined with distance matrix-based visualizations and cluster color linking, is capable of visually identifying emerging clusters, disappearing clusters, split clusters, merged clusters, enlarging clusters, contracting clusters, the shifting of cluster centroids, and changes in cluster density. As an example, when a region in a SOM becomes significantly more dense compared to an earlier SOM, and is well separated from other regions, then the new region can be said to represent a new cluster. The capabilities of ReDSOM are demonstrated using synthetic datasets, as well as real-life datasets from the World Bank and the Australian Taxation Office. The results on the real-life datasets demonstrate that changes identified interactively can be related to actual changes. The identification of such cluster changes is important in many contexts, including the exploration of changes in population behavior in the context of compliance and fraud in taxation.
... Koskivaara also brought out the possibility to employ unsupervised SOM clustering as a supportive phase before a supervised data mining model is used. Denny et al. (2008) have studied the context of compliance and fraud in taxation. They have visualized taxation data with self-organizing maps. ...
Article
This paper calls for a scholarly approach to accounting education to widen each graduate’s perspectives beyond current practice. The meaning of scholarship in teaching is briefly discussed and there is a focus on discipline based research in the accounting curriculum. The paper argues that true scholarship in accounting education cannot be achieved unless all students of accounting are exposed to the multiple paradigms that currently make up the totality of intellectual thinking in financial accounting. Further, indoctrination of students into “accounting think” must be avoided. Curriculum content related to accounting theory topics is likely to be highly selective across business schools. This paper draws on the literature related to the scholarship of teaching and educational psychology, to argue for an inclusive and transparent approach to the teaching of accounting epistemology.
... Based on hot spot analysis [4] of the lost cluster, the distinguishing characteristics were food prices inflation and consumer prices inflation. This analysis has shown that there are some other attributes that have significantly changed for these countries, e.g., an increase in measles immunization, a decrease in children labour, a decrease in mortality under five year old, and a decrease in infant mortality, as shown in Table 1. ...
Conference Paper
Full-text available
Discovering cluster changes in real-life data is important in many contexts, such as fraud detection and customer attrition analysis. Organizations can use such knowledge of change to adapt business strategies in response to changing circumstances. This paper is aimed at the visual exploration of migrations of cluster entities over time using Self-Organizing Maps. The contribution is a method for analyzing and visualizing entity migration between clusters in two or more snapshot datasets. Existing research on temporal clustering primarily focuses on either time-series clustering, clustering of sequences, or data stream clustering. There is a lack of work on clustering snapshot datasets collected at different points in time. This paper explores cluster changes between such snapshot data. Besides analyzing structural cluster changes, analysts often desire deeper insight into changes at the entity level, such as identifying which attributes changed most significantly in the members of a disappearing cluster. This paper presents a method to visualize migration paths and a framework to rank attributes based on the extent of change among selected entities. The method is evaluated using synthetic and real-life datasets, including data from the World Bank.
... studied in auditing research. Koskivaara (2004) found SOM useful in finding otherwise hidden patterns or as well to be used as a data mining technique in a continuous monitoring and controlling tool. Koskivaara also brought out the possibility to employ unsupervised SOM clustering as a supportive phase before a supervised data mining model is used. Denny et al. (2008) have studied the context of compliance and fraud in taxation. They have visualized taxation data with self-organizing maps. Their main interest was to identify and analyze the changes regarding the clustering. They compared the data from year 2006 to year 2007 and found a change in clusters that was dependant on changes in the politics ...
Chapter
Full-text available
Today, Tax Authorities receive the tax reports from companies to a large extent in digital form from the companies in Finland. Most of the tax reports are processed routinely i.e., a computer program checks that the taxes paid in advance are the correct ones and if not, the company either receives a tax return or is asked to pay the difference and there is no need for a tax audit. However, there is a small percentage of companies that need it. Most of these companies – for some reason – have not reported all their income items or have reported cost items that do not belong to their report. This could be unintended or it could be fraud. The problem is to find this percentage from the mass of tax reports. So far, the tax auditors or tax inspectors have used their past experience and posed queries to the data base, where the reports are stored, to find the ones that need a tax audit. This is not necessarily the most effective way of finding the tax reports that need a tax audit. Different data mining tools might aid in this process and make the selections of companies that need tax audit more effective. The aim of this paper is to investigate how well an unsupervised neural network method – the self-organizing map (SOM) – can perform in the task of finding the companies that need to be tax audited. SOM is a data driven approach without a need to have predefined rules or sets of values. A real data set is used and the results are compared to the results that the tax inspectors have received with their methods.
Conference Paper
Full-text available
Data Mining delivers novel and useful knowledge from very large collections of data. The task is often characterised as identifying key areas within a very large dataset which have some importance or are otherwise interesting to the data owners. We call this hot spots data mining. Data mining projects usually begin with ill-defined goals expressed vaguely in terms of making interesting discoveries. The actual goals are refined and clarified as the process proceeds. Data mining is an exploratory process where the goals may change and such changes may impact the data space being explored. In this paper we introduce an approach to data mining where the development of the goal itself is part of the problem solving process. We propose an evolutionary approach to hot spots data mining where both the measure of interestingness and the descriptions of groups in the data are evolved under the influence of a user guiding the system towards significant discoveries.
Chapter
The most important aspects for creating good SOM maps are the selection of the size and shape of the SOM, the scaling of the input variables, the selection of the neighborhood function and the learning rate, and the initialization of the model vectors. In addition, we discuss automatic procedures for color-coding of SOM maps. The data entries corresponding to, for example, different countries can be automatically colored so that similar data attain a similar color.
Article
This article presents case histories of five somewhat different uses of psychographic research, and it critically reviews the status of research in this field.
Conference Paper
K-means is a widely used partitional clustering method. While there are considerable research efforts to characterize the key features of K-means clustering, further investigation is needed to reveal whether and how the data distributions can have the impact on the performance of K-means clustering. Indeed, in this paper, we revisit the K-means clustering problem by answering three questions. First, how the "true" cluster sizes can make impact on the performance of K-means clustering? Second, is the entropy an algorithm-independent validation measure for K-means clustering? Finally, what is the distribution of the clustering results by K-means? To that end, we first illustrate that K-means tends to generate the clusters with the relatively uniform distribution on the cluster sizes. In addition, we show that the entropy measure, an external clustering validation measure, has the favorite on the clustering algorithms which tend to reduce high variation on the cluster sizes. Finally, our experimental results indicate that K-means tends to produce the clusters in which the variation of the cluster sizes, as measured by the Coefficient of Variation(CV), is in a specific range, approximately from 0.3 to 1.0.