ArticlePDF Available

Abstract and Figures

Topological data analysis (TDA) has emerged as one of the most promising approaches to extract insights from high-dimensional data of varying types such as images, point clouds, and meshes, in an unsupervised manner. To the best of our knowledge, here, we provide the first successful application of TDA in the manufacturing systems domain. We apply a widely used TDA method, known as the Mapper algorithm, on two benchmark data sets for chemical process yield prediction and semiconductor wafer fault detection, respectively. The algorithm yields topological networks that capture the intrinsic clusters and connections among the clusters present in the data sets, which are difficult to detect using traditional methods. We select key process variables or features that impact the system outcomes by analyzing the network shapes. We then use predictive models to evaluate the impact of the selected features. Results show that the models achieve at least the same level of high prediction accuracy as with all the process variables, thereby, providing a way to carry out process monitoring and control in a more cost-effective manner.
Content may be subject to copyright.
This document contains a draft version of the following paper:
W. Guo and A. G. Banerjee. Identification of Key Features Using Topological Data Analysis for
Accurate Prediction of Manufacturing System Outputs. Journal of Manufacturing Systems,
43(2), 225-234, 2017.
Readers are encouraged to get the official copy from the publisher or by contacting Dr. Ashis G.
Banerjee at
Identification of Key Features Using Topological Data
Analysis for Accurate Prediction of Manufacturing
System Outputs
Wei Guoa, Ashis G. Banerjeea,b,
aDepartment of Industrial & Systems Engineering, University of Washington, Seattle,
WA 98195, USA
bDepartment of Mechanical Engineering, University of Washington, Seattle, WA 98195,
Topological data analysis (TDA) has emerged as one of the most promising
approaches to extract insights from high-dimensional data of varying types
such as images, point clouds, and meshes, in an unsupervised manner. To
the best of our knowledge, here, we provide the first successful application of
TDA in the manufacturing systems domain. We apply a widely used TDA
method, known as the Mapper algorithm, on two benchmark data sets for
chemical process yield prediction and semiconductor wafer fault detection,
respectively. The algorithm yields topological networks that capture the
intrinsic clusters and connections among the clusters present in the data
sets, which are difficult to detect using traditional methods. We select key
process variables or features that impact the system outcomes by analyzing
the network shapes. We then use predictive models to evaluate the impact of
the selected features. Results show that the models achieve at least the same
level of high prediction accuracy as with all the process variables, thereby,
providing a way to carry out process monitoring and control in a more cost-
effective manner.
Keywords: Topological data analysis, Feature selection, Yield prediction,
Fault detection
Corresponding author
Email addresses: (Wei Guo), (Ashis G. Banerjee )
Preprint submitted to Journal of Manufacturing Systems March 3, 2017
1. Introduction1
Sensors play an essential role in carrying out product feasibility assess-2
ment, yield enhancement, and quality control in modern manufacturing sys-3
tems such as vehicle assembly, microprocessor fabrication, and pharmaceu-4
ticals development [1]. A large number of sensors of many different types5
are typically employed in such systems to measure a variety of process vari-6
ables ranging from operating conditions and equipment states to material7
compositions and processing defects over extended time periods. Thus, the8
volume of acquired data is so vast and heterogeneous that the contribution9
of individual sensor measurements in predicting the overall system outputs10
gets obscured. This prediction is made more challenging by the fact that the11
measurements are often noisy and replete with missing or outlier values. Fur-12
thermore, there is significant redundancy among the sensor measurements,13
leading to the presence of numerous false correlations in the recorded data. It14
is, therefore, necessary to perform an analysis using statistical methods that15
are specifically suited to identifying and filtering out existing correlations in16
erroneous, heterogeneous, and high-dimensional data sets.17
Historically, multivariate statistical process control (MSPC) methods,18
such as principal component analysis (PCA) and partial least-squares (PLS),19
have served as the dominant mode of addressing this problem [2]. The com-20
mon idea behind these methods is to define a new set of variables (known as21
latent variables) through linear combinations of the original variables that22
describe the sensor measurements. The set of latent variables may be re-23
duced in some cases by performing subsequent dimensionality reduction tech-24
niques. However, these methods do not work particularly well when there25
are a large number of input process variables, and they share highly non-26
linear relationships with the system outputs that cannot be modeled using27
Gaussian distributions. The methods also encounter difficulties in removing28
the false correlations among the measurements particularly when they are29
erroneous. More recently, several non-linear prediction methods have been30
developed based on response surface fitting as well as kernelized and robust31
variants of the MSPC techniques [3, 4]. While these methods may achieve32
high prediction accuracy, they do not provide any direct way of quantifying33
the contribution or impact of the individual process variables.34
Here, we present an alternative method that leverages the emerging topic35
area of topological data analysis (TDA) [5] to select the important variables36
that are subsequently used in both linear and non-linear prediction models.37
More specifically, we employ a well-established TDA method known as the38
Mapper algorithm developed by Singh et al. [6]. It is based on the core idea39
of understanding the unknown topology of the high-dimensional manifold in40
which the data resides to extract hidden patterns. In particular, it clusters all41
the level sets of the data (defined using a projection of the high dimensional42
data to a lower dimensional space) to generate a topological network that43
represents the inherent clusters and connections among the clusters in the44
actual data.45
This Mapper algorithm has already enjoyed immense popularity in fields46
such as bioinformatics and machine vision. For example, it has been used to47
reveal unique and subtle aspects of the folding patterns of RNA [7] and to48
unlock previously unidentified relationships in immune cell reactivity between49
patients with type-1 and type-2 diabetes [8]. Another influential example50
occurs in personalized breast cancer diagnosis, in which a novel subgroup of51
tumors with a unique mutational profile and 100% survival rate has been52
discovered [9]. Additionally, its deformation invariant property has been53
used to detect 3D objects from point cloud data with intrinsically different54
shapes [6].55
Despite the potential of TDA in general and the Mapper algorithm in56
particular, there has been no prior application in the manufacturing domain57
to the best of our knowledge. Inspired by the success in biomedical and vision58
problems, we employ the Mapper algorithm and show that it facilitates the59
analysis of the impact of each process variable on system outputs through60
direct visualization. It also determines whether particular subgroups of the61
data are selectively responsive to different process variables, which helps to62
monitor and diagnose processes effectively.63
We first apply the Mapper algorithm on a benchmark chemical processing64
data set to predict product yield [10]. Specifically, the shape of the generated65
topological network is used to select key features that explain the observed66
differences in the process measurements in a statistically significant manner.67
Second, we investigate the role of individual process variables in causing wafer68
failures in another publicly available semiconductor manufacturing data set.69
Although it has been recognized that k-nearest neighbor methods can identify70
faulty wafers effectively [11, 12, 13, 14], the actual process variables that71
result in the wafer anomalies have never been identified. To this end, we72
demonstrate how the Mapper algorithm rapidly traces the causality hidden73
in this high-dimensional data set.74
The rest of the paper is organized as follows. Section 2 gives an overview75
of the general characteristics of manufacturing data and the types of predictor76
(feature) and response variables that are of interest to us. In Section 3, we77
review the Mapper algorithm and its application in feature selection. We78
demonstrate the applicability of the Mapper algorithm for feature selection79
on two benchmark manufacturing data sets in Section 4. The effectiveness80
of the selected features is further assessed through predictive models. We81
conclude the paper with remarks and future research topics.82
2. Problem formulation83
In real-world manufacturing systems, data is collected using a large num-84
ber of sensors that are affixed to or embedded within different machines and85
equipment, resulting in a high-dimensional body of heterogeneous data. The86
data is usually in the form of time series measurements of different process87
variables such as temperature, pressure, density, humidity, voltage, chemical88
or material composition including the relative proportions of various con-89
stituents of alloys or mixtures, material removal or deposition rate, number90
and severity of processed part flaws and defects, and so on. The sensors, thus,91
come in myriad forms ranging from thermocouples, pressure gauges, hydrom-92
eters, hygrometers, and voltmeters to optical cameras, spectrometers, laser93
scanners, and ultrasonic transducers.94
Consequently, manufacturing sensor data is prone to noise terms, missing95
values, and outliers. These measurement errors depend on the sensitivity of96
the sensors to the operating conditions based on their underlying physical97
principles of actions. For example, it is not at all uncommon for temporary98
sensor hardware malfunction to result in missing values. A further problem is99
that of co-linearity, which is usually caused by partial redundancy in the sen-100
sor arrangement such as the placement of multiple sensors in close proximity101
to one another. The net result of these complications is that manufacturing102
systems are often “data-rich but information-poor”.103
Consequently, there is a strong need to effectively select a minimal num-104
ber of process variables that primarily affect the output variables of interest105
such as product quality and yield of a manufacturing system comprising sev-106
eral processes of varying types. As discussed earlier in Section 1, this form107
of selection facilitates process monitoring and diagnostics through targeted108
sensor data acquisition, storage, and processing. Even if it is cheap or con-109
venient to manage data from all the sensors, knowing which measurements110
of what variables matter the most makes it feasible to rapidly regulate out-111
of-control processes or adapt them to manufacture high quality products at112
desired rates.113
To formulate the problem mathematically, we suppose there are mprocess114
variables (features) and Nsensor measurements recorded at different time115
instants. Each measurement is, thus, represented by an m-dimensional vector116
xiRm,i= 1,2, . . . , N. The data is then assembled into a matrix X=117
[x1,x2,...,xN]TRN×m. Each column denotes a process variable, which118
is measured by one sensor operating alone or by the concurrent operation119
of several sensors that function in unison. The latter case is known as data120
fusion [15], which provides a wide range of sensed parameters, and is, hence,121
more reliable for data analysis.122
In a batch process with batch length L, a 3-D data array ¯
is often unfolded batch-wise into a 2-D matrix XRN×mL. In this case,124
each measurement xiRmL is a batch and each process variable is mea-125
sured Ltimes throughout the batch, hence, corresponding to Lcolumns.126
For each row, the measurement is either spatially-sampled or temporally-127
sampled. For instance, in the semiconductor manufacturing environment,128
electronic wafer map data collected from in-line measurements are sampled129
spatially across the surface of the wafer for defect inspection [16]. Usually,130
there will also be one or more response variables to reflect the output quality131
or quantity. We write the output with rresponse variables into a matrix132
Y= [y1,y2,...,yN]TRN×r, where each response variable is represented133
by one column. Response variables are commonly seen as continuous vari-134
ables denoting production yields or binary variables indicating pass or fail135
3. Technical approach137
We now present the framework of the Mapper algorithm and outline the138
typical pipeline of feature selection using the Mapper algorithm. For more139
details about the Mapper algorithm and concrete examples of real applica-140
tions, we refer the reader to [6, 17].141
3.1. Mapper algorithm142
The Mapper algorithm can be considered as a partial clustering algorithm143
inspired by the classical discrete Morse theory [6]. In topology, discrete Morse144
theory enables one to characterize the topology of high dimensional data via145
some functional level sets [18]. More specifically, given a topological space146
X, when h:X Ris a smooth real-valued function (Morse function),147
topological information of Xis inferred from the level sets h1(c) for some148
real c.149
The Mapper algorithm extends this inference to incorporate standard150
clustering methods for the analysis of high dimensional data sets. Given a151
data matrix X, the setup of the Mapper algorithm includes:152
1. Set resolution parameters: a number of intervals land overlap percent-153
age p, where p(0,100).154
2. Compute the pairwise distance matrix D= [d(xi,xj)] RN×Nbased155
on the distance metric chosen.156
3. Select a filter function f:X Rnto stratify the data.157
The most crucial step in the Mapper algorithm is the selection of the158
filter function to “guide” a clustering algorithm on the high-dimensional159
data. A few common filter functions include Gaussian kernel density es-160
timator, eccentricity filter, principal metric SVD filter, and eigenvectors of161
graph Laplacians. Moreover, we can take the projection found by dimension-162
ality reduction/manifold learning techniques that maps the high-dimensional163
data to a low-dimensional space as the filter function. For example, in the164
chemical manufacturing process study, our choice of the filter function is the165
2-D projection found by the multidimensional scaling (MDS) method. MDS166
in this case attempts to embed the data such that the pairwise distances in167
the high-dimensional space are preserved in the 2-D Euclidean space. Ac-168
cordingly, the 2-D embedding coordinates denoted by ˆx1,ˆx2,...,ˆxN, are the169
minimizers of a loss function, σ, defined as170
σ(ˆx1,ˆx2,...,ˆxN) =
Therefore, the filter function is specified as171
f:X f1×f2,(2)
where f1and f2are coordinates of ˆx1,ˆx2,...,ˆxNalong the 1st and 2nd dimen-172
sion, respectively. For the study of fault detection in the semiconductor man-173
ufacturing processes, we employ the 2-D projection found by the t-distributed174
stochastic neighboring (t-SNE) algorithm as the filter function [19]. t-SNE175
aims to preserve the joint probabilities pij that measure similarities between176
xiand xj,i, j = 1,2, . . . , N , as much as possible in the 2-D space. Specifi-177
cally, pij is defined as178
pij =pj|i+pi|j
where the conditional probability pj|ithat represents the similarity of xjto179
xiis given by180
Herein the variance of the Gaussian σicentered at xiis determined by a pre-181
defined perplexity. On the other hand, the joint probability qij that reflects182
the similarity between 2-D embedding coordinates ˜xiand ˜xjis defined based183
on a heavy-tailed Student’s t-distribution with one degree of freedom:184
qij =(1 + ||˜xi˜xj||2)1
Pk6=l(1 + ||˜xk˜xl||2)1,(5)
such that dissimilar measurements in the m-D space are mapped far apart185
in the 2-D space. ˜xi,i= 1,2, . . . , N are then determined by minimizing the186
Kullback-Leibler divergence between the joint probability distribution Pin187
the m-D space and the joint probability distribution Qin the 2-D space,188
DKL(P||Q) =
pij log pij
Likewise, the filter function in this case is given by189
f:X g1×g2,(7)
where g1and g2are coordinates of ˜x1,˜x2,...,˜xNalong the 1st and 2nd di-190
mension, respectively. In addition, it should be noted that when Nis too191
large, numerical optimization techniques are used.192
The algorithm is summarized as a flow chart in Fig. 1. After setup, the193
first step is to divide the filter range and cover it with overlapped intervals194
so that the clustering algorithm in the ensuing step focuses on the local195
information of the data that is likely to be ignored by the clustering over196
the entire data. The second step is to cluster the data in the original high197
dimensional space for every level set (subset). The Mapper algorithm is not198
tied to any particular clustering algorithm. However, it is always required to199
estimate certain parameters (thresholds) in order to determine the number200
of clusters in every level set. The last step of the algorithm is to link any201
two clusters from neighboring level sets together if they have one or more202
common data points.203
Subsets of
Original Data
Filter Function
Vertices and
Filter Range
with Subsets
Figure 1: Framework of the Mapper algorithm for generating topological networks.
In the 1-D Mapper case, the output is a 1-D simplicial complex that204
comprises only vertices (0-simplex) and edges (1-simplex). More generally,205
if the target space is Rn, higher simplices may appear in the output sim-206
plicial complex, such as triangular faces (2-simplex) whenever three clusters207
from neighboring level sets have nonempty intersections. The compressed208
representation of the simplicial complex allows us to obtain a qualitative209
understanding of how the data are organized on a large scale through di-210
rect visualization. Additionally, the resolution of the complex changes from211
coarse to fine as the number of intervals lincreases. This change of resolution212
reflects the change in topology of the data set.213
It is worth mentioning that the filter range is not necessarily covered by l214
overlapped intervals of equal length. In fact, the Mapper algorithm is highly215
parallelizable. To improve the efficiency of parallel computation, it is more216
convenient to decompose the filter range into loverlapped intervals wherein217
each interval contains the same number of points so that the running times218
would be similar for all the level sets.219
3.2. Application of Mapper algorithm to feature selection220
The output graph of the Mapper algorithm contains the information of221
clusters in the data at the local level, as well as their positions relative to222
one another and to the remainder of the data set. Therefore, the principle223
of applying the Mapper algorithm to feature selection is to recognize shapes224
in the resulting graph that encode the essential structural information of the225
data. Typical shapes of interest found in a graph are subgroups of clusters226
that display distinct patterns such as “loops” (continuous circular segments)227
and “flares” (long linear segments), as opposed to portions of the graph228
within which the local environment of each cluster is roughly identical.229
Aside from shapes of interest, we also discern the trends in terms of230
the output values associated with each cluster in the graph rendered by the231
Mapper algorithm, such as which clusters contain several measurements from232
faulty samples in the case of anomaly detection. Furthermore, we are able to233
distinguish the fundamental subgroups from artifacts by observing whether234
the shapes of the given subgroups remain consistent when the resolution235
parameters are varied over a wide range of values. After the fundamental236
subgroups of interest are detected, standard statistical tests, such as the237
Kolmogorov-Smirnov test and Student’s t-test, are performed to identify the238
features that best distinguish the subgroups from one another. The final set239
of features thus selected are then fed into classification or regression models240
to perform a desired prediction task.241
Thus, we end up addressing two main challenges in applying the Map-242
per algorithm to identify key features from manufacturing data. The first243
one pertains to a suitable selection of the filter function so as to map the244
high-dimensional data to a low-dimensional space where the data can be245
conveniently stratified. Unlike in the case of point clouds, meshes, or im-246
ages, there is no well-established function, and we select the MDS projection247
method based on final output prediction quality. The second challenge is on248
varying the resolution parameters appropriately so that the fundamental sub-249
groups are correctly distinguished from artifacts in the generated topological250
networks. Choice of a coarse granularity of variation leads to the appearance251
and disappearance of subgroups, whereas the use of very fine granularity252
makes the process time-consuming. We vary the parameters in a simple way253
such that a majority of the subgroups, which are identified at a particu-254
lar resolution, remain intact as the parameters change (the other subgroups255
appear and disappear enabling us to characterize them as artifacts).256
4. Results257
In this section, we conduct two studies to show how to achieve feature258
selection using the Mapper algorithm. With selected features, the first study259
obtains accurate predictions of productivity for a chemical processing bench-260
mark, and the second study reaches a high accuracy in fault classification for261
a semiconductor etch process.262
4.1. Prediction of manufacturing productivity263
The data is for a chemical process plant that is described in [20] and can264
be obtained from the R package “AppliedPredictiveModeling”. The data set265
contains 176 measurements of biological materials for which 57 variables are266
measured, where there are 12 biological starting materials and 45 manufac-267
turing process parameters (predictors). The starting material is generated268
from a biological unit and has a wide range of quality and characteristics. The269
manufacturing process parameters include temperature, drying time, wash-270
ing time, and concentrations of by-products at various steps. The biological271
variables are used to gauge the quality of the raw material before processing272
but cannot be changed, whereas the manufacturing process parameters can273
be changed during operations. The measurements are not independent since274
partial measurements are produced from the same batch of biological start-275
ing materials. We aim to investigate the relationships between the predictors276
and the final pharmaceutical product yield, and develop a model to estimate277
the percentage yield of the manufacturing process.278
4.1.1. Data preprocessing279
As we want to maximize the level of automation in predicting manufac-280
turing productivity for industrial applications, the data is preprocessed with281
a minimum amount of work. First, the outliers in the data set are marked as282
missing values and the features with near-zero variances are discarded. Dur-283
ing this step, BiologicalMaterial07 is removed. Second, we apply Box-Cox284
transformation to the data to eliminate distributional skewness, and scale285
each column of the data to zero mean and unit variance. The last step is to286
impute the missing values by the k-NN method with k= 5. Note that all of287
these steps can be handled automatically in the production environment.288
4.1.2. Feature selection289
To begin with, we choose Euclidean distance as the metric to represent290
the similarity between the measurements. In this work, much effort is spent291
on the suitable selection of the filter function due to the complex underlying292
structure of the data. Some commonly considered filter functions include293
the eccentricity function, linear and nonlinear projections such as PCA and294
Isomap. Regarding the quantity of interest and the purpose of the filter295
function, we use the response variable to “supervise” the stratification of296
the data. The output of the MDS method that reduces the data set to297
2 dimensions is shown to provide the smoothest variations of the response298
values over the embedding coordinates, and is eventually chosen as the filter299
In the next stage, each dimension is covered by 14 intervals of equal length301
with 80% overlap between any two successive intervals, leading to the filter302
range being divided into 196 level sets in all. Density-based spatial clustering303
of applications with noise (DBSCAN) method is subsequently employed for304
clustering in each level set, where the number of clusters is determined by the305
minimum number of measurements in a cluster and the maximum distance306
between two measurements in the same cluster [21].307
As a result, we implement the steps above in Python1and obtain a topo-308
logical network in the form of a simplicial complex as shown in Fig. 2. Each309
cluster is represented by a node, and the node size is proportional to the310
number of measurements in the cluster based on a logarithmic scale. An311
edge is generated between any two nodes from neighboring level sets that312
have at least one measurement in common. We normalize the value of the313
product yield within the range 0-1, and color each node based on the aver-314
age normalized yield value for the measurements in the node. As is seen in315
Fig. 2, the shape of the data is captured by the generated topological net-316
work after iterating through multiple times at various resolution scales. The317
resolution is set at a large number of intervals and a high overlap percentage.318
A large number of intervals helps to uncover subtle aspects of the shape of319
the data rather than a blob, and a high overlap percentage seeks to have all320
nodes connected as a single network if possible. Thus, we are able to find321
out the subgroups of interest and acquire an overall structural information322
of how the data is distributed within the network. In this problem, we are323
interested in the difference in patterns between the measurements with high324
and low yields. Notice that the high yields are separated into two subgroups,325
and the low yields are also bifurcated into two subgroups with different pat-326
terns. Therefore, two subgroups of measurements with high yield and two327
subgroups of measurements with low yield are extracted from the data as328
encircled in Fig. 2.329
1Code adapted from
0 0.1 0.2
0.3 1
Figure 2: Topological network derived from the chemical processing data at a specified
resolution. Each node is colored based on the average normalized yield value for the
measurements in the node, where the normalized yield varies between 0 and 1. High and
low yield subgroups are isolated from the rest of the network, where A and C are extracted
as outer flares and B and D are extracted from the periphery of the network as suggested
in [17].
Two-sample Kolmogorov-Smirnov (KS) test, which is sensitive to the dif-330
ference in both location and shape of the empirical cumulative distributions331
of two groups, is then performed between subgroups A and C, A and D, B332
and C, and B and D over all the columns in the data matrix to identify the333
features that best discriminate between them. We record the largest KS-334
score and the associated p-value as well as the adjusted p-value among the335
four tests for each feature. The results are presented in Table 1 and further336
visualized in Fig. 3. The p-values are adjusted using the well-established337
Benjamini-Hochberg (B-H) procedure [22, 23] that is commonly used to re-338
duce the false discovery rate (FDR) when multiple features or variables are339
evaluated for statistical significance. The B-H adjustment provides greater340
flexibility at the expense of somewhat higher FDR as compared to the tradi-341
tional Bonferroni correction method. This adjustment is, thus, better suited342
for our purpose as we want to identify all the process variables that may343
have an impact on the manufacturing system outputs. The most salient fea-344
tures are selected based on high KS-scores (>0.9) and low p-values (<0.05),345
where 11 of them are the measurements of manufacturing processes that can346
be controlled. Thus, the product yield should be improved by altering these
Table 1: Kolmogorov-Smirnov test to identify features that best differentiate between the
Feature KS-score p-value Adj. Feature KS-score p-value Adj.
p-value p-value
B01 0.882 5.53e-7 2.21e-6 M18 0.882 1.93e-7 7.20e-7
B02 1 7.57e-8 1.06e-6 M19 1 1.95e-9 2.18e-8
B03 1 7.57e-8 1.06e-6 M20 0.778 1.12e-4 3.49e-4
B04 0.917 1.16e-6 9.28e-6 M21 0.598 0.002 0.004
B05 0.739 2.36e-5 6.01e-5 M22 0.203 0.821 0.901
B06 1 7.57e-8 1.06e-6 M23 0.369 0.142 0.204
B08 1 7.55e-9 8.46e-8 M24 0.539 0.007 0.012
B09 0.417 0.054 0.082 M25 0.787 5.22e-6 1.39e-5
B10 0.728 3.32e-5 8.09e-5 M26 0.941 2.08e-8 1.17e-7
B11 0.886 4.95e-7 2.22e-6 M27 0.717 4.64e-5 1.04e-4
B12 0.952 1.34e-8 9.39e-8 M28 1 1.95e-9 2.18e-8
M01 0.533 0.008 0.013 M29 1 1.95e-9 2.18e-8
M02 1 7.55e-9 8.46e-8 M30 0.768 2.15e-5 6.35e-5
M03 0.650 0.001 1.23e-3 M31 0.944 6.14e-8 3.82e-7
M04 1 1.88e-7 1.75e-6 M32 0.941 2.08e-8 1.17e-7
M05 0.647 5.95e-4 1.28e-3 M33 0.894 1.28e-7 5.50e-7
M06 0.722 4.32e-4 1.15e-3 M34 0.238 0.718 0.855
M07 0.261 0.521 0.635 M35 0.501 0.011 0.017
M08 0.314 0.259 0.345 M36 0.787 5.22e-6 1.39e-5
M09 0.944 1.08e-6 6.05e-6 M37 0.317 0.284 0.379
M10 0.833 2.64e-5 1.06e-4 M38 0.381 0.167 0.253
M11 0.886 4.95e-7 2.22e-6 M39 0.294 0.371 0.472
M12 0.667 0.001 0.003 M40 0.278 0.560 0.713
M13 1 1.88e-7 1.75e-6 M41 0.262 0.601 0.783
M14 0.692 9.71e-5 2.01e-4 M42 0.488 0.034 0.064
M15 0.905 8.40e-8 4.28e-7 M43 0.846 7.12e-7 2.21e-6
M16 0.690 0.001 0.002 M44 0.291 0.342 0.426
M17 0.833 2.64e-5 1.06e-4 M45 0.222 0.819 0.936
aB: BiologicalMaterial; M: ManufacturingProcess
bKey features characterized with high KS-score (>0.9) and low adjusted p-value (<0.05)
are written in bold.
Process variable
Without adjustment
With adjustment
Figure 3: Key features (marked by x-axis tick labels) that best differentiate between the
subgroups are identified by Kolmogorov-Smirnov tests as those which yield a high KS-score
(>0.9) and a low corresponding adjusted p-value (<0.05).
steps in the process to have higher or lower values. We also note that the348
selection of the most salient features are not affected by the B-H procedure349
in this case.350
Fig. 4 examines the effects of the features on the product yield and probes351
the relationships between them. We color the same network nodes based on352
normalized feature values. The color of each node encodes the normalized353
feature value averaged across all the measurements in the node, with blue354
denoting a low value and red indicating a large value. We see that significant355
differences between the subgroups exist both for BiologicalMaterial06 and356
ManufacturingProcess13, both of which are selected in Table 1. Contrary357
to Fig. 4(a)(b), an unselected feature ManufacturingProcess22 shows no sig-358
nificant difference between any of the subgroups in Fig. 4(c). Meanwhile,359
on comparing with Fig. 2, BiologicalMaterial06 shows a positive relationship360
with the yield, whereas ManufacturingProcess13 displays a negative relation-361
ship with the yield.362
4.1.3. Predictive modeling363
Four regression models, PLS, random forest (RF), cubist and Gaussian364
process with a Gaussian kernel (kGP), are chosen to predict the yield of365
Significant differences
between subgroups
A/B and C/D
Positive relationship
with yield
(a) BiologicalMaterial06
Significant difference
between subgroups
B and D
Negative relationship
with yield
(b) ManufacturingProcess13
No significant difference
between subgroups
(c) ManufacturingProcess22
Figure 4: Topological networks colored based on different selection of features at the same
resolution as in Fig. 2. For every network, each node is colored based on the average nor-
malized feature value of all the measurements included in the node, where the normalized
feature value varies between 0 and 1.
the chemical manufacturing process. These models represent a linear model,366
a tree-based model, a rule-based model and a kernelized technique, respec-367
tively. We randomly split the entire data set into a training set and a testing368
set in 7:3 ratio. Parameters in each trained model are tuned to be optimal369
using 25 iterations of 10-fold cross-validated search over the parameter set.370
The trained models are then adopted to predict the percentage yield for the371
testing set.
Table 2: Estimation errors and computation times for different models with all features
and selected features
Method Errors (RMSE) Computation Times (s)
Training Testing Training Testing
PLS 1.20±0.05 1.29±0.10 1.33±0.30 0.005±0.001
RF 1.13±0.06 1.15±0.15 130±2.40 0.006±0.001
Cubist 1.00±0.07 1.15±0.13 58.5±4.11 0.025±0.004
kGP 1.21±0.04 1.25±0.11 8.14±0.43 0.002±9.4e-4
PLS 1.13±0.05 1.25±0.09 1.02±0.16 0.002±3.8e-4
RF 1.11±0.06 1.13±0.15 91.2±2.53 0.005±8.9e-4
Cubist 1.05±0.10 1.20±0.08 24.7±1.54 0.008±0.002
kGP 1.19±0.05 1.22±0.11 6.24±0.33 0.001±6.3e-4
Table 2 compares the prediction results and computation times between373
all the features and just the selected features for the models based on 30374
runs. The prediction accuracy is evaluated by the root mean squared error375
(RMSE) and computation times are measured on a laptop with an Intel Core376
i5 (1.7 GHz) CPU and 4 GB RAM. We find that the models with selected377
features achieve comparable performance as the models with all the features.378
Especially, in the case of the PLS, RF and kGP models, selected features379
outperform all the features in terms of both training and testing errors, which380
highlights the efficacy of the selected features based on the Mapper algorithm.381
Meanwhile, the training times are reduced by about 30%60% for the RF382
and Cubist models using the selected features.383
Table 3 compares the top features identified by different methods. Since384
there is almost no dominant feature due to the complexity of the data, the385
features identified by each method vary from each other. For the Mapper386
algorithm, the feature that overlap with the features identified by at least387
one of the other methods are highlighted. In fact, even though the other388
four methods have the ability to detect significant features, it is difficult for389
them to interpret how the yield is affected by these features. In contrast, the390
Mapper algorithm is well capable of unraveling the relationships between the391
features and product yield through easy and rapid visualization as shown in392
Fig. 4.393
4.2. Fault detection of semiconductor manufacturing process394
In this study, the data set2is collected from an Al stack etch process395
performed on a commercial-scale Lam 9600 plasma etch tool at Texas In-396
strument Inc. [24]. The data consists of 108 normal wafers and 21 faulty397
wafers from three separate experiments (denoted as experiment numbers 29,398
31, and 33) with 19 process variables for monitoring. Since two of the process399
variables, RF reflected power and TCP reflected power, remain almost zero400
during the entire batch, only 17 process variables are used for fault detection401
and diagnosis, as tabulated in Table 4. Moreover, one normal wafer and one402
faulty wafer are removed from the data set due to a large amount of missing403
values. Finally, because the experiments were run several weeks apart from404
one another, the process shift and drift lead to different means and covariance405
structures in the data gathered in each of the three experiments.406
The faulty wafers were intentionally induced through the modification of407
2Available at
Table 3: Top 17 important features identified by different methods
M32 M32 M32 M32 B02
M36 B06 M17 B06 B03
M17 M17 M31 M13 B06
M13 M31 B06 M17 B08
M09 B03 M13 M36 M02
M33 M13 M04 B03 M04
M06 M01 M21 M31 M13
B06 B08 B03 M33 M19
M12 B11 M09 M09 M28
B03 M39 M01 B04 M29
B04 B04 M20 M06 B12
B08 M20 M39 M29 M09
B01 B09 B04 M04 M31
B11 M06 M33 B11 M26
M31 M18 M02 M02 M32
M04 M11 M05 B01 B04
M28 M33 B10 M27 M15
aB02, B12, M30, M40 are excluded from the PLS, RF, Cubist or kGP model since these
features are removed before models being trained due to their high correlation with other
bThe important features given by PLS, RF, Cubist, kGP and the TDA method are
ranked based on the weighted sums of the absolute regression coefficients, average impurity
reduction, usage in the rule conditions, and KS-score in Table 1, respectively. Features
with the same KS-score are ordered by their feature names. For the kGP method, a LOESS
smoother is fitted to assess the relationship between each feature and the outcome. The
importance of the features is ranked by their R2statistics.
cThe ranking of feature importance varies somewhat with the training samples and the
results in Table 3 are reported based on a certain choice of the samples.
Table 4: Process variables for semiconductor wafer fault detection
No. Variables No. Variables
1 BCl3flow 10 RF power
2 Cl2flow 11 RF impedance
3 RF bottom power 12 TCP tuner
4 Endpoint A detector 13 TCP phase error
5 Helium chuck pressure 14 TCP impedance
6 Pressure 15 TCP top power
7 RF tuner 16 TCP load
8 RF load 17 Vat valve
9 RF phase error
several of the process variables: TCP power, RF power, pressure, BCl3or Cl2
flow rate, and Helium chuck pressure. To simulate an actual sensor failure,409
readings from the corresponding sensor were intentionally adjusted using a410
bias term so that its mean value was equal to the original baseline value of411
the relevant process variable. For example, if the TCP power was modified412
from its normal baseline value of 350W to a value of 400W, the values of413
TCP power in the data set would be reset to a mean of 350W by adding a414
constant bias of –50W. Table 5 lists the induced faults associated with each415
faulty wafer in the three experiments. In general, the modification of any416
one of the process variables may generally be expected to result in changes to417
the remainder of them because of correlations which may exist between the418
process variables. In this work, our goal is to determine the process variables419
which are most affected by the induced faults and use this information to420
construct a classification model for fault detection.421
4.2.1. Data preprocessing422
We follow a similar data preprocessing step as the aforementioned study.423
First, we remove the first five records to eliminate effects which due to initial424
fluctuations. To accommodate shorter batches, we retain 85 records in each425
batch to ensure that each batch record is of equal length. Next, the 3-D data426
array is unfolded batch-wise to a 2-D matrix, resulting in a total of 1445427
features, i.e. each feature is considered to be a pairwise combination of a428
process variable and a batch record. Finally, each column of the 2-D matrix429
is scaled to zero mean and unit variance.430
Table 5: Induced faults and experiments associated with each faulty wafer
No. Exp. Fault names No. Exp. Fault names
1 29 TCP power +50a11 31 Cl2flow +5
2 29 RF power -12 12 31 BCl3flow -5
3 29 RF power +10 13 31 Pressure +2
4 29 Pressure +3 14 31 TCP power -20
5 29 TCP power +10 15 33 TCP power -15
6 29 BCl3flow +5 16 33 Cl2flow -10
7 29 Pressure -2 17 33 RF power -12
8 29 Cl2flow -5 18 33 BCl3flow +10
9 29 Helium chuck pressure 19 33 Pressure +1
10 31 TCP power +30 20 33 TCP power +20
aThe addition term in each fault name represents an offset of the process variable from
its normal baseline value during the batch. For example, “TCP power +50” means that
the induced fault is an increase of 50 units in the TCP power.
4.2.2. Feature selection431
The etching process reflected in our data set is a typical nonlinear, mul-432
timodal process. For this reason, the filter function used to identify a 2-D433
embedding of the data is taken to correspond to that of t-SNE, a nonlinear434
dimensionality reduction method which, as previously mentioned, tends to435
map dissimilar measurements far apart in the low-dimensional space. The436
distance metric used between a given pair of 1445-dimensional data mea-437
surements is, therefore, taken to be the joint probability between the two,438
as defined in Eq. (3). The resolution is 24 intervals per dimension with 80%439
overlap between adjacent intervals and the DBSCAN method is once again440
used for subset clustering. Fig. 5 shows that the generated topological net-441
work of the semiconductor data is separated into three subnetworks. This is442
consistent with the fact that the data sets collected from the three experi-443
ments have different means and somewhat different covariance structure. It444
is worth noting that faulty wafers 7 and 13 are two exceptions in the sense445
that each one is grouped with other wafers which originated from a different446
0 0.1 0.2
0.3 1
Figure 5: Topological network derived from the semiconductor data at a specified resolu-
tion. Each node is colored based on the average output for the measurements included in
the node, where the output of a faulty wafer is 1 while the output of a normal wafer is 0.
Subgroups that consist of nodes containing measurements of faulty wafers are extracted
from each subnetwork.
We color each node based on the the average output values across all448
the measurements in the node. The output is either 0 or 1, representing a449
normal or a faulty wafer, respectively. As expected, measurements represent-450
ing faulty wafers are positioned at the boundary regions of each subnetwork.451
We conjecture that this is because each faulty wafer was induced differently,452
giving rise to different behaviors in the wafer processing. We further iden-453
tify subgroups consisting of nodes containing measurements of faulty wafers454
in Fig. 5, as indicated by closed elliptical paths. Since the subgroups for455
faulty wafers 10, 12, 14, and 20 have extremely small sample size, they are456
excluded from the statistical tests for feature selection. For the rest of the457
subgroups, the Wilcoxon rank-sum tests are performed across all of the pro-458
cess variables throughout the batch. As a non-parametric alternative to the459
two-sample Student’s t-test, the Wilcoxon rank-sum test is able to handle460
small sample size for non-normal distributions. These tests are conducted461
between each subgroup of faulty wafers and the nodes corresponding to nor-462
mal wafers in the rest of its subnetwork, excluding those which belong to463
other subgroups of faulty wafers. The results of these tests are shown in464
Fig. 6, where they are organized by process variable in subfigure (a) and by465
batch record in subfigure (b).466
1No. 1, 5, 9
1No. 2
1No. 3, 6, 8, 4, 13
Indicator of statistical significance
No. 7, 11
1No. 15, 16
1No. 17
1No. 18
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17
Process variable
1No. 19
(a) Features ordered by process variables
1No. 1, 5, 9
1No. 2
1No. 3, 6, 8, 4, 13
Indicator of statistical significance
No. 7, 11
1No. 15, 16
1No. 17
1No. 18
05 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85
Batch records
1No. 19
(b) Features ordered by batch records
Figure 6: Wilcoxon rank-sum test to identify the features that best differentiate between
faulty wafers and normal wafers. The features are ordered by (a) process variables and
(b) batch records, respectively. Statistically significant features (p < 0.05) have values of
1 as represented by the blue lines.
By comparing the two rankings of the features, we find that statistically467
significant features (p < 0.05) are more concentrated within individual pro-468
cess variables than within individual batch records. For example, it is evident469
that process variable 17 (Vat valve) is strongly correlated with faulty wafers,470
while process variables 5 (Helium chuck pressure) has little impact on wafer471
failure. As in Section 4,1, we perform B-H procedure to adjust the p-values472
and count the occurrence of each statistically significant feature throughout473
the batch for every process variable. The results for both raw and adjusted474
p-values are shown in Fig. 7. It is seen that the relative importance of the475
process variables remains more or less the same after B-H adjustment, es-476
pecially for the first eight process variables. Hence, we only select the first477
eight process variables for fault classification.478
4.2.3. Predictive modeling479
To build a fault detection classifier, we first compute the column means480
throughout the batch for each variable and use them for the new feature481
values. The transformed data is then randomly split into a training set and482
17 04 16 12 08 10 06 07 11 14 09 13 02 01 03 15 05
Process variables
Without adjustment of p-values
With adjustment of p-values
Figure 7: Counts of statistically significant features in terms of differentiating between
faulty and normal wafers for each process variable.
a testing set in the ratio of 7:3, where each set maintains the same pro-483
portion of normal and faulty wafers. The standard soft margin C-support484
vector machine (SVM) classifier with a Gaussian kernel, as implemented in485
LIBSVM [25], is employed for fault classification. The cost factor Cand the486
variance σof the Gaussian kernel are tuned using 10-fold cross-validation on487
the training set using an iterative grid search. We start a coarse grid search488
with exponentially growing sequences of Cand γfirst, thereafter proceeding489
with finer grid searches in the vicinity of the optimal region yielded by the490
previous grid search. Each grid search includes a total of 50 pairs of (C, γ)491
values which are used to apply the training model. To illustrate the perfor-492
mance of the fault classifiers, receiver operating characteristic (ROC) curves493
for the testing set with all process variables and with selected variables are494
reported in Fig. 8. As seen in Fig. 8, the fault classifier with the eight selected495
process variables outperforms the classifier which uses all process variables,496
indicating the effectiveness of the former variables in predicting wafer failure.497
Meanwhile, about 18% reduction in the computational time is achieved from498
1.1s to 0.9s of each run.499
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate
True positive rate
All process variables
Selected process variables
Figure 8: ROC curves of Gaussian kernel SVM classifiers on the data with all process
variables and with selected process variables.
5. Conclusion500
In this paper, we apply a powerful TDA tool, the Mapper algorithm, for501
predictive analysis of a chemical manufacturing process data set for yield502
prediction and a semiconductor etch process data set for fault detection. We503
show that the Mapper algorithm adds a new perspective to the traditional504
means of feature selection and provide critical insights hidden in the complex505
data. Through direct visualization, we generate an abstract view of the data506
to facilitate a better understanding of the casual relationships between the507
features and manufacturing system outputs. The contributions of the work508
are summarized below:509
To the best of our knowledge, we successfully demonstrate the value510
of any TDA method in the manufacturing systems domain for the first511
We effectively detect structural information present in manufacturing513
systems data, which is highly valuable as it allows identification of514
subgroups of interest for targeted hypothesis testing with respect to515
the differences in the observed patterns.516
We show that just using the identified features with the most signif-517
icant causal relationships provides a similarly high level of prediction518
accuracy as achieved with the complete set of features but with sub-519
stantially reduced training times.520
Thus, our results open a feasible path for efficient manufacturing process521
monitoring and control especially in complex systems with a large number522
of process variables. In the future, we plan to embed the Mapper algorithm523
in a sparse sensing framework to further reduce the need for measurements524
in an optimal manner. We further aim to combine the Mapper algorithm525
with existing machine learning techniques to increase the robustness of our526
approach and yield a practical method which is more suitable to the context527
of high-dimensional, heterogeneous manufacturing data in general.528
We would like to thank the anonymous reviewers for their helpful com-530
[1] J. Tlusty, Manufacturing processes and equipment, Prentice Hall, 2000.532
[2] J. F. MacGregor, T. Kourti, Statistical process control of multivariate533
processes, Control Engineering Practice 3 (1995) 403–414.534
[3] B. Sch¨olkopf, A. Smola, K.-R. M¨uller, Nonlinear component analysis as535
a kernel eigenvalue problem, Neural computation 10 (1998) 1299–1319.536
[4] R. Rosipal, L. J. Trejo, Kernel partial least squares regression in re-537
producing kernel hilbert space, Journal of machine learning research 2538
(2001) 97–123.539
[5] G. Carlsson, Topology and data, Bulletin of the American Mathematical540
Society 46 (2009) 255–308.541
[6] G. Singh, F. M´emoli, G. E. Carlsson, Topological methods for the anal-542
ysis of high dimensional data sets and 3D object recognition, in: Euro-543
graphics Symposium on Point-Based Graphics, pp. 91–100.544
[7] Y. Yao, J. Sun, X. Huang, G. R. Bowman, G. Singh, M. Lesnick, L. J.545
Guibas, V. S. Pande, G. Carlsson, Topological methods for exploring546
low-density states in biomolecular folding pathways, The Journal of547
Chemical Physics 130 (2009) 144115.548
[8] G. Sarikonda, et al., CD8 T-cell reactivity to islet antigens is unique549
to type 1 while CD4 T-cell reactivity exists in both type 1 and type 2550
diabetes, Journal of autoimmunity 50 (2014) 77–82.551
[9] M. Nicolau, A. J. Levine, G. Carlsson, Topology based data analysis552
identifies a subgroup of breast cancers with a unique mutational profile553
and excellent survival 108 (2011) 7265–7270.554
[10] W. Guo, A. G. Banerjee, Toward automated prediction of manufacturing555
productivity based on feature selection using topological data analysis,556
in: Proceedings of IEEE International Symposium on Assembly and557
Manufacturing, pp. 31–36.558
[11] Q. P. He, J. Wang, Fault detection using the k-nearest neighbor rule for559
semiconductor manufacturing processes, IEEE Transactions on Semi-560
conductor Manufacturing 20 (2007) 345–354.561
[12] Q. P. He, J. Wang, Large-scale semiconductor process fault detection562
using a fast pattern recognition-based method, IEEE Transactions on563
Semiconductor Manufacturing 23 (2010) 194–200.564
[13] Y. Li, X. Zhang, Diffusion maps based k-nearest-neighbor rule technique565
for semiconductor manufacturing process fault detection, Chemometrics566
and Intelligent Laboratory Systems 136 (2014) 47–57.567
[14] Z. Zhou, C. Wen, C. Yang, Fault detection using random projections568
and k-nearest neighbor rule for semiconductor manufacturing processes,569
IEEE Transactions on Semiconductor Manufacturing 28 (2015) 70–79.570
[15] F. Famili, W.-M. Shen, R. Weber, E. Simoudis, Data pre-processing571
and intelligent data analysis, International Journal on Intelligent Data572
Analysis 1 (1997).573
[16] C.-T. Su, T. Yang, C.-M. Ke, A neural-network approach for semicon-574
ductor wafer post-sawing inspection, IEEE Transactions on Semicon-575
ductor Manufacturing 15 (2002) 260–266.576
[17] P. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson,577
M. Alagappan, J. Carlsson, G. Carlsson, Extracting insights from the578
shape of complex data using topology, Scientific Reports 3 (2013).579
[18] J. W. Milnor, Morse theory, 51, Princeton University Press, 1963.580
[19] L. V. D. Maaten, G. Hinton, Visualizing data using t-SNE, Journal of581
Machine Learning Research 9 (2008) 2579–2605.582
[20] M. Kuhn, K. Johnson, Applied predictive modeling, Springer, 2013.583
[21] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm584
for discovering clusters in large spatial databases with noise, in: Pro-585
ceedings of 2nd International Conference on Knowledge Discovery and586
Data Mining, volume 96, pp. 226–231.587
[22] Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a588
practical and powerful approach to multiple testing, Journal of the589
Royal Statistical Society, Series B 57 (1995) 289–300.590
[23] D. Yekutieli, Y. Benjamini, Disovering the false discovery rate, Journal591
of Statiatical Planning and Inference 82 (1999) 171–196.592
[24] B. M. Wise, et al., A comparison of principal component analysis, mul-593
tiway principal component analysis, trilinear decomposition and parallel594
factor analysis for fault detection in a semiconductor etch process, Jour-595
nal of Chemometrics 13 (1999) 379–396.596
[25] C.-C. Chang, C.-J. Lin, LIBSVM: a library for support vector machines,597
ACM Transactions on Intelligent Systems and Technology (TIST) 2598
(2011) 27.599
... Plots of the features were conducted and, partitioning of the groups for the k number of clusters is also obtained. (Guo and Banerjee 2017) highlighted the problem emanating from redundancy in sensor measurements and cited the use of multivariate statistical process control (MSPC) such as principal component analysis (PCA), Partial least-squares (PLS), and commented that "the two methods have served as the dominant ways of addressing the problem." (Krim, Gentimis, and Chintakunta 2016) introduces the common tools in applied algebraic topology, in an easily applicable context and offer a framework that naturally adapted to signal processing problems with some tools from linear algebra, examples and illustrations. ...
... [16] stated that the problem could be formulated "mathematically" as follows: "Supposing there are n data variables (or features) and p sensor measurements at different recorded time instants, each measure representing an m-dimensional vector ∈ ℝ , = 1, 2, … , . "; "The data are then assembled into a matrix = [, 2 , … , ] ∈ ℝ × ; each column denotes a process variable measured by one sensor operating alone" (Guo and Banerjee 2017). The significance of the method we proposed is that it automatically fixes threshold values for variables. ...
Full-text available
TDA (i.e., Topological Data Analysis) has recently been a reliable and current research area in Statistics for extracting shape (information) from data. In this study, the researchers proposed an automated method that uses TDA & ML in identifying floods (ARs) in big data. Our process gives vital details on time series trends, which help mitigate the negative effect of ARs, such as flooding. The spatial data (between 1970 - 2018) from Nigeria Hydrological Services Agency (NIHSA) on four weather parameters were used. The daily datasets were converted to monthly datasets before the proposed method was applied. Python Software is used to develop code in the implementation of our process. Mostly, the outcome facts studied will drastically reduce disasters due to extreme events like floods and achieve some SDG goals related to the flood. The second objective is to identify potential flooding and no flooding in each zone. The work successfully used a real dataset and four variables that other studies have not used to fill a gap. After our model's training process, we obtained the best group at k = 2, where we have the highest Silhouette coefficient in each of the seven states. We have found a reasonable structure in the study considering the total average range (0.3 - 0.8). That gives an efficiency outcome of approximately 80%. Summary of clustered feature pattern shows the potential flood zone and no flood zone. We conducted cluster validity of our results using R software codes and, the test validated the best group at the same cluster k = 2. The Gap statistic shows efficiency ranging between 65% to 80% in the seven states. We found from figure 11 that only the Silhouette plot obtained optimal values at exactly k = 2; The researchers got the extent of the spread from the centroid using Excel software.
... Among these, being closer to the subject of this paper, [17] introduces a density-based algorithm for discovering clusters in large spatial databases with noise, while [18] refers to the dynamic data assigning assessment clustering. The authors of [19] propose that the identification of key features for accurate prediction of MS outputs be accomplished by using the topological data analysis. More specifically, the Mapper algorithm [20] has been applied to two benchmark datasets for the chemical process yield prediction and semiconductor wafer fault detection, in order to capture intrinsic clusters and connections among the clusters present in the datasets, which are difficult to detect using traditional methods. ...
...  The proposed MS causal modelling is applicable and works well. Unlike the existing approaches (see [8], [14], [19]), it delivers a significant number of model structure forms and facilitates the selection of the most suitable model.  The proposed method for CM allows correlating the model accuracy to the actually required level by choosing an appropriate feature cluster from the delivered clusters. ...
Full-text available
In manufacturing system management, the decisions are currently made on the base of ‘what if’ analysis. Here, the suitability of the model structure based on which a model of the activity will be built is crucial and it refers to multiple conditionality imposed in practice. Starting from this, finding the most suitable model structure is critical and represents a notable challenge. The paper deals with the building of suitable structures for a manufacturing system model by data-driven causal modelling. For this purpose, the manufacturing system is described by nominal jobs that it could involve and is identified by an original algorithm for processing the dataset of previous instances. The proposed causal modelling is applied in two case studies, whereby the first case study uses a dataset of artificial instances and the second case study uses a dataset of industrial instances. The causal modelling results prove its good potential for implementation in the industrial environment, with a very wide range of possible applications, while the obtained performance has been found to be good.
... The simplest widely accepted form of TDA is clustering. Development of the efficient computational algorithms in combination with ever-growing computational power catalyzed the adoption of more engaging and informative forms of TDA, including persistent homology (PH) [1,2], and Mapper TDA [3,4,5,6,7,8,9,10]. Both PH and Mapper reveal lacunae in the data: persistent homology does so in the exhaustive manner and focuses on capturing the scales where lacunae exists; Mapper constructs sckeletonized representation of the data set that corresponds to a particular scale and reveals lacunae that might exist on that scale. ...
We introduce an approach to the targeted completion of lacunae in molecular data sets which is driven by topological data analysis, such as Mapper algorithm. Lacunae are filled in using scaffold-constrained generative models trained with different scoring functions. The approach enables addition of links and vertices to the skeletonized representations of the data, such as Mapper graph, and falls in the broad category of network completion methods. We illustrate application of the topology-driven data completion strategy by creating a lacuna in the data set of onium cations extracted from USPTO patents, and repairing it.
... The method was introduced by [17] and its theory has been further considered in, for example, [8], [16], [26], [46] and [64]. Furthermore, persistent homology has been applied in a variety of fields including interconnectedness in the banking system [14], manufacturing systems [27] and computational biology with industrial and medical engineering applications [25,59]. Other notable applications have been in data to the Betti number for each fixed dimension homology . ...
Full-text available
Persistent homology is a common technique in topological data analysis providing geometrical and topological information about the sample space. All this information, known as topological features, is summarized in persistence diagrams, and the main interest is in identifying the most persisting ones since they correspond to the Betti number values. Given the randomness inherent in the sampling process, and the complex structure of the space where persistence diagrams take values, estimation of Betti numbers is not straightforward. The approach followed in this work makes use of features' lifetimes and provides a full Bayesian clustering model, based on random partitions, in order to estimate Betti numbers. A simulation study is also presented.
... These applications include monitoring/diagnosis in the multistage cap alignment process [14], fixture failure diagnosis [15], a systematic approach for process fault diagnosis based on BN considering incomplete data and varying noise levels. Additional application examples of ML in manufacturing systems include topological data analysis [16], deep learning [17], and genetic algorithms to evaluate form tolerances [18]. ...
Conference Paper
Reducing the dimensional variability of the body-in-white (BIW) in automotive manufacturing is perhaps the most difficult quality control problem due to complex interdependencies amongst the multiple assembly stations that a BIW must pass through in a bodyshop. As increasing quantities of dimensional data are generated in factories, manufacturers face the challenge and opportunity to derive value from the data by enabling advanced quality control methods that can realize greater dimensional stability. As the BIW moves through the bodyshop, dimensional deviations propagate and amplify to downstream stations affecting final vehicle fit-and-finish and visible quality aesthetics potentially influencing a customers’ purchase decision. Current BIW quality approaches rely on univariate statistical process control (SPC) charts. With the large amounts of complex data produced, such charts often fail to detect quality patterns that may exist in hyper-dimensional spaces. As a stop-gap measure, manufacturers attempt to remediate quality issues by assigning operators in final vehicle assembly to visually identify and manually fix apparent deviations. This paper illustrates the application of artificial intelligence (AI) to develop a real-time monitoring system that seeks to predict and detect early dimensional quality issues and eliminate the need for costly downstream corrective actions. Moreover, beyond early detection and prediction, the proposed system also facilitates diagnosis of root causes and understanding the true nature of quality issues.
Full-text available
The last decades have witnessed the rapid growth of advanced technologies and their application which has a significant influence on industrial manufacturing, leading to smart manufacturing (SM). The recent development of information and communication technologies has engendered the concept of the smart factory that adds intelligence into the manufacturing process to drive continuous improvement, knowledge transfer, and data-based decision-making. The Industrial Internet of Things (IIoT) is one of the main technologies used to enable smart factories, which is specific to industrial applications. IIoT is about connecting all the industrial assets, including machines and control systems, with the information systems and the business processes. A huge volume of data collected can feed real-time analytic solutions provided by Artificial Intelligence (AI), Big Data Analytics, and Decision Support Systems (DSS), which can lead to optimal industrial operations. In SM, based on modern technologies of the IIoT, the process of collecting, transforming, and storing data from all stages of the production process becomes easier and more efficient, promoting the era of Big Data in manufacturing. AI algorithms provide powerful tools for exploiting the wealth of data generated in the IIoT. By extracting useful information and features from Big Data, the AI algorithms allow complex tasks such as predicting, maintaining, monitoring, and optimizing the production process to be performed smartly and efficiently. To combine human knowledge with these above results, DSS is integrated to help manufacturers manage data, analytics, modeling, and planning effectively. This chapter aims to provide a survey of the key techniques that enable SM, including IIoT, Big Data, DSS, and AI. Several important perspectives for the decentralized techniques in SM will be discussed. There are two illustrative examples demonstrated, the source code can be found
Full-text available
We are now witnessing the rapid development and powerful application of advanced technologies, leading to the fourth industrial revolution, or Industry 4.0. The wide use of cyber-physical systems (CPS) and the internet of things (IoT) lead to the era of Big Data in industrial manufacturing. Artificial intelligence (AI) algorithms emerge as powerful analytic tools to process and analyze Big Data. These advanced technologies resulted in the introduction of a new concept in Industry 4.0: smart manufacturing. In order to fully understand this new concept in the context of Industry 4.0, this chapter presents a survey on the key components of smart manufacturing as well as the link between them, including the industrial internet of things (IIoT), Big Data, and AI. Several studies enabled smart manufacturing, and perspectives for further research are reviewed and discussed. Finally, we provide a case study where the problem of remaining useful lifetime (RUL) prognostic is considered using AI algorithms.
Topological data analysis (TDA) has recently been a very reliable research area in Statistics for extracting shape from data. Flooding annually destroys properties, buildings, farmland, loss of life, etc. in various regions of the world. In this study, a new K-means clustering method that combined TDA and machine learning (ML) functions. The method is aimed at solving flood problems by identifying the feature patterns of floods associated with seven selected states in Nigeria; predicting flood menace; measuring the extent of spread of the resultant clusters (degree of flood risk) and choosing the best test that measures the validity of the analysis used. The study method is designed to provide vital information about shape characteristic of spatial data. Other advantages include its flexibility with other methods. It is threshold-free (i.e., no need to fix any threshold criteria for detection method; it has properties which does that). After our model's training process, we obtained the best group at k = 2, where we have the highest Silhouette coefficient that gave an efficiency outcome of approximately 80%. The method was able to detect the flooding and no flooding areas in the data, and the discovery of variability of clusters. The findings can provide information to dweller in flooding zones to vacate and avert flood disaster, and for the risk managers to take action. We recommend the method in flood mitigation and control. Further research is needed to explore the combination aspect.
Big data analytics is playing a more and more prominent role in the manufacturing industry as corporations attempt to utilize vast amounts of data to optimize the operation of plants and factories to gain a competitive advantage. Since the advent of Industry 4.0, also known as smart manufacturing, big data analytics, combined with expert domain knowledge, is facilitating ever-greater levels of speed and automaticity in manufacturing processes. The semiconductor industry is a fundamental driver of this transformation; moreover, due to the highly complex and energy-consuming nature of the semiconductor manufacturing process, semiconductor fabrication facilities (fabs) can also benefit greatly from incorporating big data analytics to improve production and energy efficiency. This paper developed a big data analytics framework, along with an empirical study conducted in collaboration with a semiconductor manufacturer in Taiwan, to optimize the energy efficiency of chiller systems in semiconductor fabs. Chiller systems are one of the most energy-consuming systems within a typical modern fab. The developed big data analytics framework allows production managers to ensure that chiller systems operate at an optimized level of energy efficiency under dynamically changing conditions, while fulfilling the chilling demands. Compared to the commonly-used heuristics previously employed at the fab to tune chiller system parameters, by the utilization of big data analytics, it is shown that fabs can achieve substantial energy savings, greater than 12%. The developed framework and the lessons learned from the empirical study are not only generalizable but also useful for practitioners who are interested in applying big data analytics to optimize the performance of other equipment systems in fabs.
Conference Paper
Full-text available
In this paper, we extend the application of topo-logical data analysis (TDA) to the field of manufacturing for the first time to the best of our knowledge. We apply a particular TDA method, known as the Mapper algorithm, on a benchmark chemical processing data set. The algorithm yields a topological network that captures the intrinsic clusters and connections among the clusters present in the high-dimensional data set, which are difficult to detect using traditional methods. We select key process variables or features that impact the final product yield by analyzing the shape of this network. We then use three prediction models to evaluate the impact of the selected features. Results show that the models achieve the same level of high prediction accuracy as with all the process variables, thereby, providing a way to carry out process monitoring and control in a more cost-effective manner.
Full-text available
Fault detection technique is essential for improving overall equipment efficiency of semiconductor manufacturing industry. It has been recognized that fault detection based on k nearest neighbor rule (kNN) can effectively deal with some characteristics of semiconductor processes, such as multimode batch trajectories and nonlinearity. However, the computation complexity and storage space involved in neighbors searching of k NN prevent it from online monitoring, especially for high dimensional cases. To deal with this difficulty, principal component-based k NN has also been presented in literature, in which dimension reduction is done by principal component analysis (PCA) before k NN rule implemented to fault detection. However, dimension reduction by PCA may distort the distances between pairs of samples (trajectories). Thus the false alarm and missing detection of k NN for fault detection may increase in principal component subspace because PCA fails to preserve pairwise distances in subspace. To overcome this drawback, we propose a new fault detection method based on random projection and k NN rule, which combines the advantages of random projection in distance preservation (in the expectation) and k NN rule in dealing with the problems of multimodality and nonlinearity that often coexist in semiconductor manufacturing processes. An industrial example illustrates the performance of the proposed method.
Full-text available
In the semiconductor industry, traditional multivariate statistical process monitoring methods and pattern classification based detection methods have been developed to detect the semiconductor process faults. However, they do not show superior performance due to the limits of these methods and the unique characteristics of semiconductor processes such as non-linearity and multimodal batch trajectories. This paper presents a novel diffusion maps based k-nearest-neighbor rule (DM-kNN) technique that can reduce data-storage costs and enhance the performance of fault detection by integrating diffusion maps analysis with k-nearest-neighbor rule. DM-kNN takes full advantage of the dimensionality reduction and information preserving properties of DM to extract the low dimensional manifold feature that optimally preserves the intrinsic nonlinear structure of the data set. Then the adapted kNN rule based fault detection method is applied to the low dimensional manifold feature space to detect potential faults. The effectiveness and robustness of DM for dimensionality reduction and feature extraction are verified in simulation experiments compared with other linear and nonlinear dimensionality reduction methods. In addition, DM-kNN is applied to monitor the semiconductor manufacturing process. The fault detection results of the proposed method are demonstrated to be superior to those of the MPCA, FD-kNN, PC-kNN and FS-kNN approaches.
Background: Prepulse inhibition (PPI) is an indicator of sensorimotor gating reactivity in rodents and humans sensitive to psychotropic drugs. Studies suggest increased or decreased dopamine levels in the brain alter several modalities of sensorimotor gating. However, so far, little is known about the role norepinephrine (NE) plays on such mechanisms. An antidepressant agent, reboxetine (RBX), is a selective NE transporter blockade that can affect PPI in patients with major depression by elevating synaptic NE levels. Methods: The present study was designed to examine the effects of RBX versus the wake-promoting agent, modafinil (MOD), on the PPI of tail-pinch (TP) challenged rats. Results: The results showed a disparity in changes of sensorimotor gating reactivity between RBX and MOD administrations in rats during noxious TP stress. Conclusion: These data indicated the antidepressant mechanisms of RBX and MOD might differ. We do not support the view that MOD is promising as an augmenter of antidepressants.
When predicting a numeric outcome, some measure of accuracy is typically used to evaluate the model’s effectiveness. However, there are different ways to measure accuracy, each with its own nuance. In Section 5.1 we define common measures for evaluating quantitative performance. We also discuss the concept of variance-bias trade-off (Section 5.2), and the implication of this principle for predictive modeling. In Section 5.3, we demonstrate how measures of predictive performance can be generated in R.
Previous cross-sectional analyses demonstrated that CD8+ and CD4+ T-cell reactivity to islet-specific antigens was more prevalent in T1D subjects than in healthy donors (HD). Here, we examined T1D-associated epitope-specific CD4+ T-cell cytokine production and autoreactive CD8+ T-cell frequency on a monthly basis for one year in 10 HD, 33 subjects with T1D, and 15 subjects with T2D. Autoreactive CD4+ T-cells from both T1D and T2D subjects produced more IFN-γ when stimulated than cells from HD. In contrast, higher frequencies of islet antigen-specific CD8+ T-cells were detected only in T1D. These observations support the hypothesis that general beta-cell stress drives autoreactive CD4+ T-cell activity while islet over-expression of MHC class I commonly seen in T1D mediates amplification of CD8+ T-cells and more rapid beta-cell loss. In conclusion, CD4+ T-cell autoreactivity appears to be present in both T1D and T2D while autoreactive CD8+ T-cells are unique to T1D. Thus, autoreactive CD8+ cells may serve as a more T1D-specific biomarker.