Unsupervised algorithms for the automatic classification of EWS maps: a comparison
ABSTRACT Recently, it has been shown that the classification of electrical wafer sorting failure maps can be performed by means of unsupervised methods. In this work four different unsupervised methods are compared: SOM, Kmeans, neural gas, and an expectation maximization. The algorithms are compared using a benchmark based on a probabilistic model. The performance of the classification is assessed by means of an new index, called indexF, based on the knowledge of the real classification. Moreover it is studied the correlation between the proposed index and the following indexes: CHindex, Dindex, Iindex and average likelihood.

Conference Paper: Process diagnosis via electricalwafersorting maps classification
[Show abstract] [Hide abstract]
ABSTRACT: The commonality analysis is a proven tool for fault detection in semiconductor manufacturing. This methodology extracts subsets of production lots from all the available data. Then, data mining techniques are used only on the selected data. This approach loses part of the available information and does not discriminate among the lots. The new methodology performance the automatic classification of the electrical wafer test maps in order to identify the classes of failure present in the production lots. Subsequently, the proposed procedure uses the process history of each wafer to create a list of the root cause candidates. This methodology is the core of the software tool ACID which is currently used for process diagnosis at the Agrate site of the ST Microelectronics. A real analysis is presented.Data Mining, Fifth IEEE International Conference on; 12/2005
Page 1
Unsupervised algorithms for the automatic classification of EWS maps: a
comparison
Federico Di Palma, Giuseppe De Nicolao
University of Pavia
Via Ferrata 1, 27100 Pavia, Italy
{federico.dipalmagiuseppe.denicolao}@unipv.it
Oliver M. Donzelli, Guido Miraglia
STMMicroelectronics
Via Olivetti, Agrate Brianza, Italy
{oliver.donzelliguido.miraglia}@st.com
Abstract – Recently, it has been shown that the
classification of Electical Wafer Sorting failure maps can
be performed by means o funsupervised methods. In this
work four different unsupervised methods are compared:
SOM, KMeans, Neural Gas, and an Expectation
Maximization. The algorithms are compared using a
benchmark based on a probabilistic model. The
performance of the classification is assessed by means of
an new index call IndexF based on the knowledge of the
real classification. Moreover it is studied the correlation
between the proposed index and the following indexes:
CHindex, Dindex, Iindex and average likelihood.
INTRODUCTION
If the devices that fail at the Electrical Wafer Sorting
(EWS) tests are visualized as black pixels, the spatial
distribution of the failures is likely to show characteristic
patterns. Different shapes are possible: circular spots,
rings, semiring, repetitive chessboardlike patterns, to
mention the most frequent. These patterns can be used to
trace back to the problems that originated the failures
either by analyzing their qualitative features or by
correlating them with the lot history. Hence, the interest
for algorithms that perform the automatic classification of
large wafer sets on the basis of their EWS maps. Two
approaches are possible: the supervised and the
unsupervised ones. The supervised classifiers require the
preliminary classification of a training set of wafers by a
human operator. This approach is time consuming and
when the process and/or the product changes the training
has to be performed exnovo. Recently, it has been shown
that wafer classification can be performed by means of
unsupervised methods: these algorithms create clusters of
wafers by identifying their common features and do not
use any training set. In [1], the Kohonen's Self Organizing
Map (SOM) has been used successfully to classify EWS
wafer maps. Several other unsupervised learning methods
may be used to classify wafer maps.
The first aim of this paper is to compare four different
unsupervised methods: SOM, KMeans, Neural Gas and
an Expectation Maximization (EM) classifier. The
algorithms are compared using a benchmark based on a
probabilistic model. The performance of the classification
is assessed by means of an index (here denoted as F
index) that measures the misclassification rate.
Another important issue is finding indexes that measure
the classification goodness and are appliable to real data.
In fact with real data the Findex (which assumes the
knowledge of the true classification) is no longer usable.
Hence the second aim of this work is to study the
correlation between the Findex and the following indexes
[2]: CHindex, Dindex, Iindex and average likelihood.
In the following section the probabilistic model, the F
index and the benchmark are presented. The third section
is devoted to description of the clustering algorithms. The
fourth one illustrates the results of the comparison. The
choice of a clustering index appliable to real data is
discussed in Section 5. The conclusion summarizes the
main results.
PRELIMINARIES
To make a comparison a goodness criterion is required.
The proposed measure is based on the knowledge of the
correct classification. Hence the need for a benchmark
made of simulated wafer maps whose true classification is
known. In order to obtain the simulated data a
probabilistic model is adopted.
Probabilistic Model
In [1] a probabilistic model was proposed. According to
this model the electrical test of a die has only two results:
good (0) or failed (1). Then a binary Bernoulli variable Xd
is used describe the outcome of the EWS test for the dth
die. It was assumed that the electrical failure of a single
device occurs independently of the failure of the others
with a probability P(Xd=1)=f(xd,yd) where xd and yd are the
planar coordinates of the die d. A complete wafer Xw can
be created by simulating Nd Bernoulli trials. The origin of
the planar coordinates is placed in the center of the wafer
and the scales of the cartesian axes are such that the wafer
has radius one. This system of coordinates allows one to
describes different scales of integration with the same
spatial probability.
The benchmark
A set of production lots is often affected by several
failures with different occurrence rates. In this work we
simulated 8 classes each of which characterized by its own
spatial distribution, reported in Figure 1. The spatial
distributions reflect some common failure patterns:
standard production (Class #2), low yield wafers (Class
#1) repetitive horizontal and vertical (Classes #5 and #6),
1
Page 2
spot and ring (Classes #3, #4 and #8), semiring (Class
#7). The clustering algorithms were tested on several
benchmarks with different number of devices (Nd) and
wafers (Nw) as shown in Table 1
Figure 1: Spatial probability of the simulated classes
Benchmark123456
# Wafers300 5008001200300 500
# Dies601 601601601377377
Benchmark789 10 1112
# Wafers80012003005008001200
# Dies601601950950950950
Table 1. The twelve data set of the benchmark
Classification assessment
In a classification two kinds of error can be observed: an
identified class includes elements coming from different
real classes, or a real class is splitted into two or more
identified classes. The first kind of misclassification can
be disruptive for diagnosis purposes. In fact the reference
pattern is explained by the union of different physical
problems whose separate detection becomes problematic.
Conversely, if a real class is splitted into two or more
classes, each identified class is made of wafer
characterized by the same physical problem. This
misclassification is much less harmful because in the
process diagnosis it is possible to merge together class
whose reference pattern are similar. For this reason a good
classification have to produce homogeneous clusters, each
of which made of wafers belonging to the same real class.
A simply way to visualize the number and type of
misclassified items is given by the scatter matrix [3]. In
Table 2 the scatter matrix of a 4 cluster classification of
100 items that belongs to 3 real classes is reported. It can
be observed that only clusters #2 and #4 are
homogeneous. Clusters #1 and #3 present respectively 4
and 2 elements that are not homogenous. These elements
are called "misplaced". For each identified cluster j it is
possible obtain the number of misplaced elements as
( )
j
()()
∑
i
=
−=
c
N
i
jiTjiTM
~
1
,max,
where with T(i,j) denotes the scatter matrix and Ñc is the
number of real classes. To measure the degree of
homogeneity of the classification we introduce a new
index ,called Findex, defined as
( )
j
∑
i
=
−
N
=
c
N
cw
c
M
N
N
F
1
~1
~
It can be shown that the FIndex ranges from 0 (optimal
clustering) up to 1 (worst clustering).
N1
N2
N3
N4
Ñ1
160 140
Ñ2
04500
Ñ3
40219
Table 2. Scatter matrix of a four class clustering of 100 data
belonging to three real classes
CLUSTERING ALGORITHMS
In this work are compared three wellknown unsupervised
clustering algorithms: SOM, Neural Gas, KMean. This
neural network techniques generate a cluster for each
neuron. In each neuron c is stored a Nddimension vector
pc that represent the centroid of the cluster c.
Kmeans
This algorithm is the simplest and fastest one. A single
iteration is composed by two steps. In the first one each
data is associated with the nearest cluster center. In the
second step the centers are updated by computing the
barycenter (centroid) of each cluster. The two steps are
repeated until the centers do not change. The parameters
of this algorithm are the number of clusters Nc and the
initial values of the cluster centers W0. It is important to
notice that if two cluster centers have been initialized with
the same center, the algorithm can encounter difficulties.
SOM
In a SOM network the neurons are organized on a r×c
grid. At each iteration t a single wafer Xw (a binary vector)
is presented to the network and winner neuron V is
defined as
cw
c
pXV
−=
minarg
where pc is the characteristic pattern of the cth class, that
is a real vector whose entries belongs to [0,1]. Then all the
neurons are updated as follows
) (
Vcthpp,,δ
()(
)
cwcc
pX
−⋅+=
Γ
(1)
2
Page 3
where δ(c,V) is the distance on the grid between neurons c
and V and hΓ(.,.) is a suitable function monotonic
decreasing in both its arguments. An interesting feature of
the SOM is that similar reference patterns are put close to
each other on the neuron grid. The function hΓ in (1)
depends on a parameter set Γ which include the following
parameters [4]: initial ηi and final ηf learning rate, initial σi
and final σf effective width of the topological
neighbourhood and total number of iterations tf.
In [1] the authors proposed a choice of Γ for the clustering
of EWS maps. It was observed that the the first 4
parameters affected the speed of learning [5] while the last
has to do with the overall learning time. The first four
parameter were fixed to general purpose values [4] and a
fine tuning of the parameter tf. was performed. The
proposed set is reported in Table 3.
ηi
ηf
σi
σf
λi
λf
tf
Γ .5.0053.1 80 Nw
Δ.5.005 10.150 Nw
Table 3.Parameter Setting for SOM and Neural Gas
Neural Gas
The Neural Gas is quite similar to the SOM algorithm. The
main difference is due to the fact that the neurons
topology (e.g. the neuron grid in the SOM) is not fixed. At
each iteration the neurons are ordered according to the
distance from the proposed input (wafer). The nearest
neuron has order 0 and the last has order Nc1. The
training update of the neurons is done according to
) (
Xcltgpp,
( )(
)
cwcc
p
−⋅+=
∆
(2)
where l(c) is the order of neuron c and gΔ(.,.)is a function
monotonic decreasing in both its arguments. The function
gΔ in (2) depends on a parameter set Δ [4]:
{
ffifi
t ,,,,
}
λληη=∆
where λi and λf affect the amount of information to give to
the "far" neurons. The consideration made for the SOM
method still hold for the Neural Gas. In particular the
same tuning procedure can be adopted: the first 4
parameter of Δ ware fixed to general purpose value and a
fine tuning was made on tf. Several classifications were
performed on the benchmarks described in Section 2 and
the Findex was used to assess the performance. The
optimal parameter values are reported in Table 3.
Expectation Maximization (EM)
Given the observations X and a probabilistic model
characterized by a parameter vector θ, the likelihood L is
given by L=P(Xθ). According to our probabilistic model
the parameters are the Nc probability vectors pc and the
data are the Nw binary vectors Xw. The maximum
likelihood estimate θML is the vector that maximize the
likelihood. The EM is just an efficient algorithm to
perform such a maximization in classification problem. A
notable property of the method is that it guarantees an
increase of the likelihood at each iteration.
SOM Kmean GasEM
Random .02816 .04262.03958.20953
Centre .02931.04036 .04308.19346
Prototypes.02871.07915.03789 .61586
Data mean .02929.08779 .04332.21315
MinMax.02943.08182.04310.60589
Table 4. Value of the FIndex in 60 experiments using different
initialization techniques
COMPARISON
All the algorithm considered have two types of
parameters: the number of clusters Nc and the initial
values of the references patterns of the classes.
The number of clusters is not a critical choice for this
application. In fact as already observed, in the
classification of EWS maps the main target is to find
homogenous clusters. This is possible only if the number
of identified classes Nc is greater than the true one Ñc. If
we have some a priori knowledge on Ñc it makes sense
overestimate Nc. In the benchmark, we let Nc=12.
Conversely, the initial reference pattern may affect the
performance of the classifier. In order to choose the best
methods, different initializations were considered [6]:
Random In this initialization the patterns are chosen
randomly from the valid set.
Center This initialization requires that the valid starting
points define a limited region; then the technique chooses
the center of the validity set as starting point. In order to
obtain different values for each neuron a small random
perturbation is added.
Random Prototypes Nc randomly chosen data are used as
initialization.
Data Mean This initialization chooses as starting point the
mean of the data set. In order to obtain different values for
each neuron a small random perturbation is added.
Min Max This techniques use an algorithm to select Nc
well distanced data as starting point.
For each benchmark 60 different initialization were
generated for each technique. For each classification the
FIndex was evaluated. In Table 4 the average over the 12
3
Page 4
benchmarks of the median of the Findex over the 60
experiments are reported.
If the bold values in Table 4 are considered, it can be
notice that SOM gives the best performances.
MEASURING CLASSIFICATION PERFORMANCES
ON REAL DATA
The evaluation of the FIndex requires the knowledge of
the true classification, that is obviously not available in a
real problem. To assess the performance of a classification
some other clustering index must be used. Generally these
indexes evaluate the withinclass and the betweenclass
distances. The first factor is an index of the wafer
closeness to the reference pattern while the second one
evaluates the distance between the clusters. The most
widespread ones are [2]: DaviesBouldin (DB), Dunn (D),
Calinski Harabasz (CH) and I indexes. All of them regard
a classification with compact and well separated cluster as
a good one. However, for our specific application the
distance between classes is not the main concern,
especially if two or more real classes are close to each
other. A further index is the likelihood introduced in
Section 3. For computational reason we use the "average
device likelihood" (AL) defined as:
wcNN
L AL =
The goal of our study is to find the index that best
approximates the FIndex. For this purpose, from each of
the 12 benchmarks 30 different clusterings were randomly
created. For each classification all the indexes was
evaluated. In Figure 2 some significant scatter plots are
reported. Then for each index and benchmark the
correlation coefficient was computed. Table 5 shows the
average correlation coefficient with the FIndex. It can be
noticed that the AL and CH indexes gives the best
correlation.
DB CHDI AL
.9885.9912 .5649.8939 .9914
Table 5. Correlation Coefficient between FIndex and other
clustering indexes.
CONCLUSION
In this work a comparison was made between some
clustering techniques applied to EWS maps classification:
kmeans, SOM, Neural gas and EM. The last method and
the simulated benchmark are based on a simple
probabilistic model. In order to evaluate the performances
of a classification, a new index, called FIndex, was
introduced. This index was created considering the
possible use for fault detection in a semiconductor
manufacturing environment. The classifiers were tested
simulating different numbers of devices and wafers. From
this analysis it turns out that the SOM is most suitable
algorithms. The last result presented in this work is the
choice of a performance index to be used the classification
of real data. The objective was to find an index that
correlates well with the FIndex, which is proportional to
the number of misclassified wafer but requires the
knowledge of the correct classification. It turns out to be
the Average Likelihood and the CH Index.
Figure 2. Comparison of indexes measuring the classification
goodness. The AL and the CH index correlate well with the F
index that is proportional to the number of misclassified items
REFERENCES
[1] F DiPalma, G DeNicolao, E Pasquinetti, G Miraglia, F
Piccinini, “Unsupervised spatial pattern classification of
electrical failures in semiconductor manufacturing,
Pattern Recognition Letters, 2005 (to appear).
[2] U. Maulik and S. Bandyopadyay, “Performance
eavaluation of some clustering algorithms and validity
indices,” IEEE Trans. Pattern Anal. Machine Intell., vol.
24, no. 12, pp. 1650–1654, 2002.
[3] A. D. Gordon, On Classification. London: Chapman &
Hall/CRC, 1999.
[4] B. Fritzke, Some competitive learning methods, 1997,
RuhrUniversität Bochum. ftp://ftp.neuroinformatik.ruhr
unibochum.de/pub/software/NN/DemoGNG/sclm.ps.gz
[5] S. Haykin, Neural Networks, a comprehensive
foundation. New York: MacMillan College Publishing
Company, 1994.
[6] A. Juan, J. Garc´ýaHern´andez, and E. Vidal, “Em
initialisation for bernoulli mixture learning.” in SSPR/SPR
workshop, 2004, pp. 635–643.
4