Content uploaded by Yexi Jiang
Author content
All content in this area was uploaded by Yexi Jiang on Oct 14, 2014
Content may be subject to copyright.
Applying Data Mining Techniques to Address Critical
Process Optimization Needs in Advanced Manufacturing
Li Zheng1, Chunqiu Zeng1, Lei Li1, Yexi Jiang1, Wei Xue1, Jingxuan Li1, Chao Shen1,
Wubai Zhou1, Hongtai Li1, Liang Tang1, Tao Li1, Bing Duan2, Ming Lei2and Pengnian Wang2
1School of Computer Science, Florida International University, Miami, FL, USA 33174
2ChangHong COC Display Devices Co., Ltd, Mianyang, Sichuan, China 621000
ABSTRACT
Advanced manufacturing such as aerospace, semi-conductor,
and flat display device often involves complex production
processes, and generates large volume of production data.
In general, the production data comes from products with
different levels of quality, assembly line with complex flows
and equipments, and processing craft with massive control-
ling parameters. The scale and complexity of data is be-
yond the analytic power of traditional IT infrastructures. To
achieve better manufacturing performance, it is imperative
to explore the underlying dependencies of the production
data and exploit analytic insights to improve the production
process. However, few research and industrial efforts have
been reported on providing manufacturers with integrated
data analytical solutions to reveal potentials and optimize
the production process from data-driven perspectives.
In this paper, we design, implement and deploy an inte-
grated solution, named PDP-Miner, which is a data analyt-
ics platform customized for process optimization in Plasma
Display Panel (PDP) manufacturing. The system utilizes
the latest advances in data mining technologies and Big
Data infrastructures to create a complete analytical solu-
tion. Besides, our proposed system is capable of supporting
automatically configuring and scheduling analysis tasks, and
balancing heterogeneous computing resources. The system
and the analytic strategies can be applied to other advanced
manufacturing fields to enable complex data analysis tasks.
Since 2013, PDP-Miner has been deployed as the data analy-
sis platform of ChangHong COC1. By taking the advantages
of our system, the overall PDP yield rate has increased from
91% to 94%. The monthly production is boosted by 10,000
panels, which brings more than 117 million RMB of revenue
improvement per year2.
1ChangHong COC Display Devices Co., Ltd is one of the
world’s largest display device manufacturing companies in
China (http://www.cocpdp.com).
2http://articles.e-works.net.cn/mes/article113579.htm.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD ’14 New York City, New York USA
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
Categories and Subject Descriptors: H.2.8[Database
Applications]: Data Mining; H.4[Information Systems Ap-
plications]: Miscellaneous
Keywords: Advanced Manufacturing, Big Data, Data Min-
ing Platform, Process Optimization
1. INTRODUCTION
The manufacturing industry involves the production of
merchandise for use or sale using labor and machines, tools,
chemical processing, etc. It has been the mainstay of many
developed economies and remains an important driver of
GDP (Gross Domestic Product). According to the Bureau
of Economic Analysis data, every dollar goods in manu-
facturing generates $1.48 in economic activity, the highest
economic multiplier among major economic sector3. With
the advancement of new technologies, a lot of manufactur-
ers utilize cutting-edge materials and emerging capabilities
enabled by physical, biological, chemical and computer sci-
ences. The improved manufacturing process often refers to
as “advanced manufacturing” [15, 28]. For example, orga-
nizations in oil and gas industry apply new technologies to
transform raw data into actionable insight to improve asset
value and product yield while enhancing safety and protect-
ing the environment.
In advanced manufacturing, a medium-sized or large man-
ufacturing sector often arranges complex and elaborate pro-
duction processes according to the product structure, and
generates large volume of production data collected by sen-
sor technologies [8], Manufacturing Execution System (MES)
[14], and Enterprise Resources Planning (ERP) [22]. In
practice, the production data contains intricate dependen-
cies among a tremendous amount of controlling parameters
in the production workflow. Generally, it is extremely dif-
ficult or even impossible for analysts to manually explore
such dependencies, let alone proposing strategies to poten-
tially optimize the workflow.
Fortunately, the use of data analytics offers the manufac-
turers great opportunities to acquire informative messages
towards optimizing the production workflow. However in
practice, there is a significant application gap between
manufacturers and data analysts in observing the data and
using automation tools. Table 1 highlights the perspective
difference between manufacturers and analysts on three im-
3JEC Democratic staff calculations based on data from the
Bureau of Economic Analysis, Industry Data, Input-Output
Accounts, Industry-by-Industry Total Requirements after
Redefinitions (1998 to 2011).
Table 1: Perspective Differences Between Manufacturers and Data Analysts.
Capacity Capability Knowledge
Manufacturers
•huge production output
•sophisticated workflow
•complex supply chain
•control yield rate
•optimize production line
•effective parameter setting
•private Know-How
•high dependency to experts
•high cost of testing
Data Analysts
•large number of samples
•high-dimensional data
•complex param dependencies
•process optimization
•feature reduction and selection
•feature association analysis
•utilize domain expertise
•knowledge sharing
•knowledge management
Application Gap
•utilize customized data analysis algorithms to mine the underlying knowledge;
•provide configurable task platforms to allow automatic taskflow execution;
•enable efficient knowledge representation and management.
portant aspects: (1) Capacity, i.e. what the data looks like;
(2) Capability, i.e., how the data can be utilized; and (3)
Knowledge, i.e., how to perform knowledge discovery and
management.
To bridge the gap, it is imperative to provide automated
tools to the manufacturers to enhance their capability of an-
alyzing production data. Data analytics in advanced man-
ufacturing, especially data mining approaches, have been
targeting several important fields, such as product qual-
ity analysis [20, 26], failure analysis of production [3, 25],
production planning and scheduling analysis [1, 2], analytic
platform implementation [7, 8], etc. However, few research
and industrial efforts have been reported on providing man-
ufacturers with an integrated data analytical platform,
to enable automatic analysis of the production data and ef-
ficient optimization of the production process. Our goal is
to provide such a solution to practically fill the gap.
1.1 A Concrete Case: PDP Manufacturing
Plasma Display Panel (PDP) manufacturing produces over
10,000 panels for a daily throughput in ChangHong COC
Display Devices Co., Ltd (COC for short). The produc-
tion line is near 6,000 meters and the process contains 75
assembling routines, and 279 major production equipments
with more than 10,000 parameters. The average produc-
tion time throughout the manufacturing process requires 76
hours. Specifically, the workflow consists of three major pro-
cedures shown in Figure 1, i.e., front panel,rear panel, and
panel assembly. Each procedure contains multiple sequen-
tially executed flows, and each flow is composed of multiple
key routines. The first two procedures are executed in paral-
lel, and each pair of front and rear panels will be assembled
in the assembly procedure. Figure 2 depicts the real assem-
bly line of one routine (Tin-doped Indium Oxide, ITO) in
front panel procedure, which gives us a sense of how complex
the complete production process will be.
Panel Assembly Processing
PDP Manufacturing Production Flow
Front Panel Processing Rear Panel Processing
3 major procedures 75 assembly routines 279 major equipments
over 10,000 parameters 6000m production line 76hr processing time
Figure 1: PDP Manufacturing Production Flow.
Figure 2: An Example Routine in PDP Workflow.
There are 83 types of equipments in the PDP manufactur-
ing process, each of which has a different set of parameters to
fulfill the corresponding processing tasks. The parameters
are often preset to certain values to ensure the normal oper-
ation of each equipment. However, the observed parameter
values often deviate from the preset values. Further in the
production environment, external factors, e.g., temperature,
humidity, and atmospheric pressure, may potentially affect
the product quality as the raw materials and equipments are
sensitive to these factors. The observed values of external
factors vary significantly in terms of sensor locations and
acquisition time. The production process generates a huge
amount of production data (10 Gigabytes per day with 30
Million records).
In daily operations, the manufactures are concerned with
how to improve the yield rate of the production. To achieve
this goal, several questions need to be carefully addressed,
including
•What are the key parameters whose values can signif-
icantly differentiate qualified products from defective
products?
•How the parameter value changes affect the production
rate?
•What are the effective parameter recipes to ensure high
yield rate?
Answering these questions, however, is a non-trivial task
due to the scale and complexity of the production data, and
is impossible for domain analysts to manually explore the
data. Hence, it is necessary to automate the optimization
process using appropriate infrastructural and algorithmic so-
lutions.
1.2 Challenges and Proposed Solutions
The massive production data poses great challenges to
manufacturers in effectively optimizing the production work-
flow. During the past two years, we have been working
closely with the technicians and engineers from COC to in-
vestigate data-driven techniques for improving the yield rate
of production. During this process, we have identified two
key challenges and proposed the corresponding solutions to
each challenge as follows.
In general, highly automatic production process often gen-
erates large volume of data, containing a myriad of con-
trolling parameters with the corresponding observed values.
The parameters may have malformed or missing values due
to inaccurate sensing or transmission. Therefore, it is cru-
cial to efficiently store and preprocess these data, in order
to handle the increasing scale as well as the incomplete sta-
tus of the data. In addition, the analytics of the production
data is a cognitive activity towards the production workflow,
which embodies an iterative process of exploring the data,
analyzing the data, and representing the insights. A practi-
cal system should provide an integrated and high-efficiency
solution to support the process.
Challenge 1. Facing the enormous data with sustained
growth, how to efficiently support large-scale data analysis
tasks and provide prompt guidance to different routines in
the workflow?
Existing data mining products, such as Weka [9], SPSS
and SQL Sever Data Tools, provide functionalities to facil-
itate users to conduct the analysis. However, these prod-
ucts are designed for small or medium scale data analysis,
and hence cannot be applied to our problem setting. To ad-
dress Challenge 1, we design and implement an integrated
Data Analytics Platform based on a distributed system [32]
to support high-performance analysis. The platform man-
ages all the production data in a distributed environment,
which is capable of configuring and executing data prepro-
cessing and data analysis tasks in an automatic way. The
platform has the following functionalities: (1) cross-language
data mining algorithms integration, (2) real-time monitor-
ing of system resource consumption, and (3) balancing the
node workload in clusters.
Besides Challenge 1, in advanced manufacturing, the
controlling parameters in the production workflow may cor-
relate with each other, and potentially affect the production
yield rate. Several analysis tasks identified by PDP analysts
include (1) discovering the most related parameters (Task
1); (2) quantifying the parameter correlation with the prod-
uct quality (Task 2); and (3) proposing novel parameter
recipes (i.e., parameter value combinations) to improve the
yield rate (Task 3). A reasonable way to effectively fulfill
these tasks is to utilize suitable data mining and machine
learning techniques. However, existing algorithms cannot
be directly applied to these tasks, as they may either lack
the capability of handling large-scale data, or fail to consider
domain-specific data characteristics.
Challenge 2. Facing various types of mining require-
ments, how to effectively adapt existing algorithms for cus-
tomized analysis tasks that comprehensively consider the do-
main characteristics?
In our proposed system, Challenge 2 is effectively tack-
led by developing appropriate data mining algorithms and
adapting them to the problem of analyzing the manufactur-
ing data. In particular, to address Task 1, we propose an
ensemble feature selection method to generate a stable pa-
rameter set based on the results of various feature selection
methods. To address Task 2, we utilize regression mod-
els to describe the relationship between product quality and
various parameters. To address Task 3, we apply associ-
ation based methods to identify possible feature combina-
tions that can significantly improve the quality of product.
To make the system an integrated solution, we also provide
the functionalities of data exploration (including compara-
tive analysis and data cube) and result management.
Our proposed solution, PDP-Miner, is essentially a scal-
able, easy-to-use and customized data analysis system for
large-scale and complex mining tasks on manufacturing data.
Exploitation of the latest advances in data mining and ma-
chine learning technologies unleashes the potential to achieve
three critical objectives, including enhancing exploration and
production,improving refining and manufacturing efficiency,
and optimizing global operations. Since 2013, PDP-Miner has
been deployed as the production data analysis platform of
COC. By using our system, the overall yield rate has in-
creased from 91% to 94%, which has brought more than 117
million RMB of revenue per year4.
1.3 Roadmap
The rest of the paper is organized as follows. Section 2
presents an overview of our proposed system, starting from
introducing the system architecture, followed by the details
of three interleaved analysis modules, including data explo-
ration, operational analysis and result representation. In
Section 3, we explore possible feature selection strategies to
identify pivotal parameters in the production process, and
propose an ensemble feature selection approach to obtain
robust yet predominant parameter set. In Section 4, we dis-
cuss the task of measuring the importance of parameters,
and utilize regression models to examine how the param-
eter change will affect the yield rate. Section 5 describes
our strategy of mining the knowledge of data, that is, to
employ discriminative analysis (e.g., association mining) to
reveal the dependencies of parameters. Section 6 represents
the system deployment, in which system performance eval-
uation is described and some important real findings are
presented. Finally, Section 7 concludes the paper.
2. SYSTEM OVERVIEW
The overall architecture of PDP-Miner is shown in Fig-
ure 3. The system, from bottom to top, consists of two com-
ponents: Data Analytics Platform (including Task Manage-
ment Layer and Physical Resource Layer ) and Data Analysis
Modules.
Data Analytics Platform provides a fast, integrated, and
user-friendly system for data mining in distributed environ-
ment, where all the data analysis tasks accomplished by
Data Analysis Modules are configured as workflows and also
automatically scheduled. Details of this module are provided
in Section 2.1.
Data Analysis Modules provide data-mining solutions and
methodologies to identify important production factors, in-
cluding controlling parameters and their underlying correla-
tions, in order to optimize production process. These meth-
ods are incorporated into the platform as functions and mod-
ules towards specific analysis tasks. In PDP-Miner, there are
3 major analytic modules: data exploration,data analysis,
and result management. In Section 2.2, more details are pro-
4http://articles.e-works.net.cn/mes/article113579.htm
vided by presenting our data mining solutions customized
for PDP production data. A sample system for demon-
stration purpose is available at http://bigdata- node01.
cs.fiu.edu/PDP-Miner/demo.html.
Task Management Layer
Physical Resource Layer
Storage
Resources Database
Algorithm
Library
HDFS
Local File
System
Job
Scheduler
Job
Manager
Resource
Manager
Analytic Task Integrator Resource Monitor
Analytic Task Manager System Manager
Data Analysis Module
Data Exploration Data Analysis Result Manager
Data Cube
Comparison
Analysis
Parameter
Selection
Parameter
Value Recipe
Regression
Reporting
Feedback
Graphics
Workstations
Standalone
Computers
Computing
Clusters
Visualization
Figure 3: System Architecture.
2.1 Data Analytics Platform
Traditional data-mining tools or existing products [10, 21,
19, 18, 23, 30] have three major limitations when applied to
specific industrial sectors or production process analysis: 1)
They support neither large-scale data analysis nor handy
algorithm plug-in; 2) They require advanced programming
skills when configuring and integrating algorithms for com-
plex data mining tasks; and 3) They do not support large
scale of analysis tasks running simultaneously in heteroge-
neous environments.
To address the limitations of existing products, we develop
the data analytic platform based on our previous large-scale
data mining system, FIU-Miner [32], to facilitate the exe-
cution of data mining tasks. The data analytic platform
provides a set of novel functionalities with the following sig-
nificant advantages [32]:
•Easy operation for task configuration. Users, espe-
cially non-data-analyst, can easily configure a complex
data mining task by assembling existing algorithms
into a workflow. Configuration can be done through
a graphic interface. Execution details including task
scheduling and resource management are transparent
to users.
•Flexible supports for various programs. The existing
data mining tools, such as data preprocessing libraries,
can be utilized in this platform. There is no restriction
on programming languages for those programs exist or
to be implemented, since our data analytic platform
is capable of distributing the tasks to proper runtime
environments.
•Effective resource management. To optimize the uti-
lization of computing resources, tasks are executed by
considering various factors such as algorithm imple-
mentation, server load balance, and the data location.
Various runtime environments are supported for run-
ning data analysis tasks, including graphics worksta-
tions, stand-alone computers, and clusters.
2.2 Data Analysis Modules
2.2.1 Data Exploration
The Comparison Analysis and Data Cube are capable of
assisting data analysts to explore PDP operation data effi-
ciently and effectively.
Comparison Analysis Comparison Analysis, shown
in Figure 6(a), provides a set of tools to help data analysts
quickly identify parameters whose values are statistically dif-
ferent between two datasets according to several statistical
indicators. Comparison Analysis is able to extract the top-k
most significant parameters based on predefined indicators
or customized ranking criteria. It also supports comparison
on the same set of parameters over two different datasets
to identify the top-kmost representative parameters of two
specified datasets.
Data Cube Data Cube, shown in Figure 6(b), provides
a convenient approach to explore high dimensional data so
that data analysts can have a glance at the characteristics of
the dataset. In addition, Data Cube can conduct multi-level
inspection of the data by applying OLAP techniques. Data
analysts can customize a multi-dimensional cube over the
original data. Thus, the constructed data cubes allow users
to explore multiple dimensional data at different granulari-
ties and evaluate the data using pre-defined measurements.
2.2.2 Data Analysis
The data mining approaches in algorithm library can be
organized as a configurable procedure in Operation Panel,
as shown in Figure 6(c). The Operation Panel is a unified
interface to build a workflow for executing such task auto-
matically. The Operation Panel contains the following three
main tasks:
Important Parameter Selection By modeling the
important parameter discovery task as a feature selection
problem, several feature selection algorithms are implemented
adaptively based on the production data. Moreover, an ad-
vanced ensemble framework is designed to combine multiple
feature selection outputs. Based on these implementations,
the system is able to generate a list of important parameters,
shown in Figure 6(d).
Regression Analysis The purpose of Regression Anal-
ysis (shown in Figure 6(f)) is to discover the correlations
between the yield rate and the controlling parameters. The
regression model not only indicates whether a correlation
exists between a parameter and the yield rate but also quan-
tifies the extent that the change of the parameter value will
influence the yield rate.
Discriminative Analysis Discriminative analysis (See
Figure 6(e)) is an alternative approach to identify the fea-
ture values that have strong indication to the target labels
(panel grade). By grouping and leveraging the features of
individual panels, this approach is able to find the most dis-
criminative rules (a set of features with the values) to the
target labels according to the data.
Feature Combination Mining
Find the important features
HDFS
Data
Loader
Feature
Selection
(mRMR)
Feature
Selection
(InfoGain)
Ensemble
Feature
Selection
Top K
Features
Feature
Selection
(ReliefF)
HDFS
Finding
Frequent
Feature
Combinations
Distribute
Data
Wait for
All
Outputs
Regression Analysis
Pruning
Combinations
Choice
[T>threshold]
DBWriter
DB
Regression
Model Using
Important
Features
Export
Influential
Features
Workflow 1: Parameter Selection
+ Regression Analysis
Workflow 2: Parameter Selection
+ Pattern Analysis
Figure 4: A Sample Workflow for PDP Manufactur-
ing Data Analysis.
To illustrate how Data Analysis Modules are incorporated
with the Data Analytics Platform, Figure 4 illustrates two
example analytic tasks wrapped as two workflows. As shown,
Workflow 1 indicates an analysis procedure of building re-
gression models with selected important parameters; Work-
flow 2 indicates another procedure of identifying reasonable
parameter value combinations based on previously selected
parameters. The Operation Panel provides a user-friendly
interface shown in Figure 5 to facilitate workflow assembly
and configuration. Users only need to explicitly create tasks
dependencies before the workflow executing automatically
by our platform.
Figure 5: Data Analysis Workflow Configuration.
2.2.3 Result Management
The analytic results are being categorized into three
types: the important parameter list, the parameter value
combinations, and the regression model. Templates are de-
signed to support automatically storage, update, and re-
trieval of discovered patterns. Results are recorded based on
analysis tasks and can be organized in terms of important
equipment, top parameters, and task list. For each result,
corresponding domain experts can refine and give feedback,
shown in Figure 6(h). In addition, visualizations are pro-
vided to summarize the analytic results, collected feedbacks,
and status of current knowledge (shown in Figure 6(g)). It
provides a flexible interface for maintaining domain knowl-
edge very efficiently.
3. ENSEMBLE FEATURE SELECTION
In manufacturing management, the primary goal is to im-
prove the yield rate of products by optimizing the manu-
facturing workflow. To this end, one important question is
to identify the key parameters (features) in the workflow,
which can significantly differentiate qualified products from
defective ones. However, it is a non-trivial task to select a
subset of features from the huge feature space. To tackle this
problem, we initially experimented several widely used fea-
ture selection approaches. Specifically, we use Information
Gain [11], mRMR [5] and ReliefF [24] to perform parameter
selection. Figure 7 shows the top 10 selected features by
these three algorithms on a sampled PDP dataset.
As observed in Figure 7, the three feature subsets share
only one common feature (“Char 020101-008”). Such a phe-
nomenon indicates the instability of feature selection meth-
ods, as it is difficult to identify the importance of a feature
from a mixed view of feature subsets. In general, the se-
lected are the most relevant to the labels and less redundant
to each other based on certain criteria. However, the corre-
lated features may be ignored if we select a small subset of
features. In terms of knowledge discovery, the selected fea-
ture subset is insufficient to represent important knowledge
about redundant features. Further, different algorithms se-
lect features based on different criteria, which renders the
feature selection result instable.
/ŶĨŽƌŵĂƚŝŽŶŐĂŝŶ;ƚŽƉϭϬͿ
ŵZDZ
;ƚŽƉϭϬͿ
ZĞůŝĞĨ
&;ƚŽƉϭϬͿ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϰ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϲ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬ
ϵ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϯ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϰ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬ
ϴ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϱ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϴ
ŚĂƌͺϭϬϬϭϬϮ
Ͳ
Ϭϳϵ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϮ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϵ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
ϭϵϵ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϲ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϭϬ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
ϮϬϴ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϭ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϳ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
ϮϭϮ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬ
ϴ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϲ
ŚĂƌͺϭϬϬϭϬϮ
Ͳ
Ϭϭϯ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϳ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϯ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
Ϯϭϯ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
ϭϲϴ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϰ
ŚĂƌͺϭϬϬϭϬϮ
Ͳ
Ϭϴϭ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϯ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϯ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϴ
Figure 7: Selected Features by Different Algorithms.
The stability issue of feature selection has been studied re-
cently [4, 13] under the assumption of small sample size. The
results of these work indicate that different algorithms with
equal classification performance may have a wide variance
in terms of stability. Another direction of stable feature se-
lection involves exploring the consensus information among
different feature groups [17, 29, 31], which first identifies con-
sensus feature groups for a given dataset, and then performs
selection from the feature groups. However, these methods
fail to consider the correlation between selected features and
unselected ones, which might be important to guide us for
feature selection.
In our system, inspired by ensemble clustering [27], we
employ the ensemble strategy on the results of various fea-
ture selection methods to maintain the robustness and sta-
bility of feature selection. The problem setting of stable
feature selection is defined as follows. Given a dataset with
Mfeatures, we employ Nfeature selection methods which
for an arbitrary feature ireturn a N-lengthed vector yi, i =
1,2,...,M. Each entry of yiis 1 or 0 indicating whether
the feature iis selected or not by the corresponding feature
selection method. Since we are concerned with whether to
Data Analysis
Data Exploration Result Management
(g) Visualization
(a) Comparison Analysis
(f) Regression Analysis
(d) Parameter Selection
(e) Discriminative Analysis
(c) Operation Panel
(b) Data Cube (h) Result List , Feedback Collector
Figure 6: PDP-Miner Analysis Modules.
select a feature or not, we assume a feature i, in the form
of results of Nfeature selection methods, yi, is generated
from a mixture of two multivariate components, indicating
selected features and unselected features, i.e.,
p(yi) =
2
X
j=1
πjp(yi|θj),(1)
where πjdenotes the mixture probability of j-th component
parameterized by θj, in which the n-th entry θj n means the
probability of the output of n-th feature selection method
equals to 1. We further assume conditional independence
between feature selection methods. Therefore,
p(yi|θj) =
N
Y
n=1
p(yin|θj n).(2)
As the result of a feature selection method in the vector yi
is either selecting (1) or not selecting (0) the feature i, the
probability of the feature ibeing selected by the n-th feature
selection method, i.e., p(yin|θj n), could be represented by a
Bernoulli distribution
p(yin|θj n) = θyin
jn (1 −θjn )1−yin .(3)
In addition, we assume that all the features are i.i.d. Then
the log-likelihood of the unified probabilistic model is
L=
M
X
i=1
log
2
X
j=1
πjp(yi|θj).(4)
To learn the parameters πjand θj, j ∈ {1,2}, we use
Expectation-Maximization (EM) algorithm. To this end,
we introduce a series of hidden random variables zi, i = 1,2
to indicate yibelonging to each component, i.e., the param-
eters of the random variable zi1, zz2, zi1+zi2= 1.
The iterative procedure of EM will be terminated when
the likelihood of the mixture model does not change too
much constrained by a predefined threshold. The hidden
variable ziindicates the probabilities of membership of fea-
ture yiwith respect to all mixture components. It is in some
sense similar to the situation in Gaussian mixture models.
The feature is assigned to the j-th component that the cor-
responding value zij is the largest in zij, j ∈ {1,2}. As a
feature selection method will eventually generate two sub-
sets of features (selected or not), it is reasonable to make
two mixture components.
After obtaining the assignments of features to compo-
nents, say φ(zi), we group features into two categories, i.e.,
selected/unselected. In practice, the number of selected fea-
tures are significantly less than the unselected ones, and
hence the features that are not selected by any feature selec-
tion method are put into a large category. The features in
the other category are final feature selection results. Specif-
ically for each component j, we pick the features that have
the membership assignment, i.e., zij , greater than a pre-
defined threshold τ, and then put these features into the
selected category. In this way, we can discard features with
low probabilities for selection, and hence the stability of fea-
ture selection can be achieved by assembling different feature
selection results using the mixture model.
4. REGRESSION ANALYSIS
To optimize the production process, it is imperative to
discover the parameters that have significant influence on
the yield rate and quantify such influence. In our system,
an actionable solution is to explicitly establish a relationship
between controlling parameters and the yield rate, which can
be achieved using regression analysis.
Formally, assume the daily observations are i.i.d. Then
the relationship between features (parameters) and the yield
rate can be modeled as a function f(x,w) with additive
Gaussian noise, i.e.,
y=f(x,w) + , ∼ N (0, β−1),(5)
where ydenotes the yield rate, x= (x1,· · · , xd)Tdenotes
the set of features that may have impact on y, and wde-
notes the weight of features. The noise term is a zero mean
Gaussian random variable with precision β.
In our system, we implement two linear regression based
models: ridge regression and lasso regression [12]. From the
perspective of maximum likelihood, the linear relationship
can be expressed as
ln p(y|w, β) = X
i
ln N(yi|wTxi, β−1).(6)
For both models, we leverage least square to quantify the
error, i.e.,
E(w) = (1
2Pi(yi−wTxi)2+1
2λ||w||2
2,ridge regression
1
2Pi(yi−wTxi)2+1
2λ||w||1,lasso regression .
(7)
In advanced manufacturing domain, the number of fea-
tures is usually large (in PDP scenario, the number of fea-
tures is more than 10K), and therefore ensemble feature se-
lection (described in Section 3) is applied before building the
regression model. To conduct the regression, we incorporate
three categories of features:
1. The parameters of the equipments involved in the man-
ufacturing process. This category of features is col-
lected from the log of the equipments.
2. The parameters of the environment, such as tempera-
ture, humidity, and pressure, etc. This category of the
features is collected from the deployed sensors in each
workshop.
3. The features of the materials, such as the viscosity,
consistency, and concentration, etc. This category of
feature is collected from material descriptions and re-
ports.
After integrating all the features, we normalize each dimen-
sion of the features using standardization, i.e. x−¯
X
std(X).
The linear regression can be solved efficiently. When the
dataset is small, the closed form can be directly obtained,
i.e. ˆ
wridge = (λI+XTX)−1XTyfor ridge regression and
ˆ
wlasso =sgn((XTX)−1XT)(|(XTX)−1XTy|−λ) for lasso
regression, where Xdenotes the matrix of the features with
ith row indicating the feature set xi. For large datasets,
we train the model iteratively by using stochastic gradient
descent for ridge regression and coordinate gradient for lasso
regression.
The weights of the trained model can be intuitively in-
terpreted. Firstly, the value |wi|indicates the conditional
correlation between the feature xiand the yield rate given
the other features. In general, a larger weight indicates a
larger conditional correlation. Moreover, the corresponding
p-value of each feature can be leveraged to measure the like-
lihood of the correlation. The smaller the p-value, the less
likely such correlation is false.
By performing regression analysis on the PDP data, we
find some interesting correlations. For example:
1. The variance of the humidity of the air has positive
correlation wtih the yield rate. This provides empirical
evidence to support the conjecture of PDP technicians
that the variance of the humidity plays an important
role in affecting the yield rate.
2. The pressure of the air has positive correlation with
the yield rate, whereas its variance changes inversely.
The less the pressure changes, the higher the yield rate
would be.
3. The workshop temperature and its variance vary slightly
within a small range, and the corresponding weight is
very small. In practice, the change of the tempera-
ture may affect the usage of materials as well as the
production process. Hence, it is often being carefully
controlled by technicians.
5. DISCRIMINATIVE ANALYSIS
Discriminative analysis mines the feature knowledge of the
PDP panel data from a different perspective. It is used
as an alternative way to reveal the underlying relationship
between the features and the panel grades. Specifically, it
helps discover parameter recipes as well as sets of feature
values which are closely related to qualified panels and de-
fective panels. In PDP-Miner, the techniques of association
based classification [16] and low-support discriminative pat-
tern mining [6] are leveraged to conduct the discriminative
analysis.
5.1 Association based classification
Association based classification integrates classification and
association rule mining to discover a special subset of asso-
ciation rules called class association rules (CARs). A CAR
is an implication of the form {r:F→y}, where Fis a
subset of the entire feature value set and ydenotes the class
label (the PDP panel grade in our scenario). Each CAR is
associated with support sand confidence c, indicating how
many records contain Fand the ratio of records containing
Fthat are labeled as y. In general, CARs contain strong
discriminative information to infer the PDP panel grades.
A rule-based classifier can be built by selecting a subset
of the CARs that collectively cause the least error, i. e.
r1, r2, ..., rn→y.
Compared with feature selection and regression analy-
sis, association based classification enables the possibility of
early detection due to the unique characteristics of CARs. If
CARs only refer to the features in the early manufacturing
process, this method can quickly identify semi-finished yet
defective panels, and prevent further resource waste.
The early detection strategy is useful in the advanced
manufacturing domain, as any earlier detected bad semi-
finished product can directly reduce the manufacturing cost.
For the production with a large number of assembling pro-
cedures, such a reduction is not trivial.
5.2 Low support discriminative pattern min-
ing
A manufacturing process could consist of hundreds of as-
sembling procedures with thousands of tunning parameters.
When the feature dimension is high, standard association
rule based methods would become time-consuming. A na¨
ıve
solution for this scenario is to increase the support thresh-
old to speed up mining. However, this strategy may miss
interesting low-support patterns.
To address this problem, we adapt the idea of low support
pattern mining algorithm (SMP) [6] and integrate the algo-
rithm into PDP-miner.SMP aims at mining the discrim-
inative patterns by leveraging a family of anti-monotonic
measures called SupMaxK.SupMaxK organizes the discrim-
inative pattern set into nested layers of subsets.
5.2.1 Discriminative Patterns Detection
Many association mining methods utilize “support” to se-
lect rules/patterns. Different from the traditional associa-
tion mining, the “discriminative support” is defined to mea-
sure the quality (discriminative capability) of the rule set:
DisS(α) = |Squalif ied(α)−Sdef ectiv e(α)|,(8)
where αis a set of parameter values, Squalified and Sdef ectiv e
denote the“support” of αover two classes, indicating whether
the target panel is qualified or defective.
A na¨
ıve implementation using this measure suffers from
low efficiency [6] when pruning frequent non-discriminative
rules. To address this issue, a new measure – SupM axK(α)
is introduced to help prune unrelated patterns by estimating
Sdefectiv e(β).
SupM axK (α) = Sq ualif ied(α)−maxβ∈α(Sdef ectiv e(β)),
(9)
where |β|=K,βis the subset of α. Three reasons make
this measure useful: (1) SupM axK can help select more
discriminative patterns as Kincreases; (2) SupM axK is a
lower bound of DisS; (3) SupM axK is anti-monotonic.
Due to the anti-monotonic property of SupM axK,SMP
can naturally be utilized to mine the discriminative patterns
whose support are low but have strong indication to the
panel grades.
6. SYSTEM DEPLOYMENT
We evaluate our proposed system from two aspects: the
system performance and the real findings. The evaluation
demonstrates that our system is a practical solution for
large-scale data analysis, through integrating and adapting
classic data mining techniques and customizing them for spe-
cific domains, particularly, advanced manufacturing.
6.1 System performance
Our system is able to perform large-scale data analysis
and can be easily scale up. To demonstrate the scalability
of PDP-Miner, we design a series of cluster workload balance
experiments in both static and dynamic computing environ-
ments. The experiments are conducted on a testbed clus-
ter separated from the real production system. The cluster
consists of 8 computing nodes with different computing per-
formances.
In the experiments, one frequent analysis task of PDP-
Miner is created using the job configuration interface, which
consists of two sequential functions, i.e., Parameter Selec-
tion →Parameter Combination Extraction. For evaluation
purpose, ten different datasets (about 30 million records)
are generated by sampling from the original 1-year produc-
tion datasets. The analysis task is conducted over these
datasets in two types of experiments: Exp I Workload bal-
ance in a static environment and Exp II Workload balance
in a dynamic environment. In the following, we describe the
detailed experimental plans as well as the results.
Exp I: Each node in the cluster is deployed with one
Worker. We configure 10 parameter selection tasks with dif-
ferent running times in PDP-Miner. Each job starts at time 0
and repeats with a random interval (<1 minutes). Figure 8
shows how our system balances the workloads based on the
underlying infrastructures. The x-axis denotes the time and
the y-axis denotes the average number of completed jobs for
each Worker at the given moment during the task execution.
Clearly, the accumulated number of completed jobs (the blue
solid bars) increases linearly, whereas the amortized number
of completed jobs (the white empty bars) remains stable.
This shows that when the cluster remains unchanged, our
system achieves a good balance of the resource utilization
by properly distributing jobs. The effective distribution of
jobs guarantees a full use of existing resources to maximize
the throughput without incurring resource bottleneck.
Exp II: To investigate the resource utilization of PDP-
Miner under a dynamic environment, we initially provide
four nodes (node1∼4), each with 1 Worker, and then add
the other four nodes (node5∼8) 10 minutes later. To emu-
late the nodes with different computing powers, the newly-
added nodes are deployed with 2 to 5 Workers, respectively.
Each Worker is restricted to use only 1 CPU core at a time,
so the node deployed with more Workers can have more
powerful computing resources. Figure 9 shows the number
of jobs completed by each node during observing the sys-
Figure 8: Load Balance in Static Environment.
tem execution for 70 minutes. The number of jobs on each
node is segmented every 10 minutes. It clearly shows that
the number of completed jobs is proportional to the number
of Workers on each node, which indicates that our system
can balance the workloads in a dynamically changed cluster.
It also demonstrates that the entire system can be linearly
extended with resources of different computing power.
Figure 9: Load Balance in Dynamic Environment.
6.2 Real Findings
PDP-Miner has been playing an important role on reveal-
ing deeper and finer relations behind big data in COC’s real
practice. As an example, WorkFlow1 in Figure 4 is executed
to extract important parameters from a single procedure,
named barrier-rib (BR). 30 selected parameters are reported
and verified by domain experts. Within these 30 parame-
ters, 15 of them have already been carefully monitored by
the analysts, which is consistent with domain knowledge.
Another 9 parameters, which are not monitored in the pre-
vious production, are confirmed to have great impact on the
product quality. After applying WorkFlow1 to the entire
production data, 197 important parameters are reported by
our system, among which 133 parameters are consistent with
production experience, and 50 parameters are verified by do-
main experts to have direct impact on the product quality.
The details are shown in Figure 11 (blue portion ∼consis-
tent with domain expertise; red portion ∼confirmed to be
important which was previously ignored; white portion ∼
excluded after verification).
To discover meaningful parameter values, WorkFlow2 in
Figure 4 is used. We separate the production data to two
Figure 10: Real Case of Regression Analysis Results.
sets by the product qualities (GOOD, i.e., qualified prod-
ucts, and SCRAP, i.e., defective products) and execute Work-
Flow2 on these two sets, respectively. The analysis generates
hundreds of frequent parameter value combinations for each
given dataset (the number of outputs can be restricted by
empirically setting a threshold of confidence). By extracting
the frequent combinations in SCRAP that are not frequent
in GOOD, we can obtain the value combinations that may
result in defective products. Figure 12 shows a verification of
a sample combination <para-xxxx-014=0, para-xxxx-015=0
or 24, para-xxxx-043=44 or 48>(big red crosses indicate
that the values present densely on SCRAP products). Such
a parameter value combination should be avoided in the pro-
duction practice.
Not Relevant
Important
(match expertise)
Important
(previously ignored)
133
50
14
Figure 11: Important Parameters Discovered.
para-xxxx-014
para-xxxx-043
para-xxxx-015
Figure 12: A Sample Parameter Combination.
By applying regression analysis in WorkFlow1 of Figure 4,
we discovered that environmental parameters, such as tem-
perature and humidity, have significant correlations with the
product quality. Further analysis confirmed that when the
surrounding temperature of BR Furnace is under 27◦C, the
number of defective products with BR Open or BR Short
increases dramatically. Figure 10 depicts such findings.
The aforementioned findings are some typical examples
obtained from the practical usage of our proposed system.
Most of our findings have been validated by PDP technicians
and are incorporated into their operational manual.
6.3 Deployment Practice
PDP-Miner has been successfully applied in ChangHong
COC’s PDP production line of the 3rd and 4th generations
of products for manufacturing optimization. Every time the
product line is upgraded, the yield rate drops significantly
since previous parameter settings could not match new prod-
ucts requirements. The earlier parameters are tuned prop-
erly, the greater the cost will be reduced. PDP-Miner has
been intensively used in such situations for problem diag-
nosis, including quickly identifying problematic parameter
settings, detecting abnormal parameter values, and moni-
toring sensitive parameters.
In summary, our system brings several great benefits in
optimizing the production process:
•Through establishing the relationship between param-
eter settings and product quality, manufacturers are
more confident to properly control the production pro-
cess based on analytical evidence. The cost has been
greatly reduced as the number of defective products
decreases.
•The prompt analysis of the production data enables
the quick diagnosis on parameter values, especially
when upgrading the assembly line or handling unex-
pected faults. As a result, the throughput increases.
•A knowledge database is constructed to manage useful
analytic results that have been verified and validated
by existing domain expertise. Technicians can refer to
the database to look for possible solutions and control
the assembly line more efficiently.
By taking advantage of our system, the overall PDP yield
rate increases from 91% to 94%. Monthly production capac-
ity is boosted by 10,000 panels, which brings more than 117
million RMB of revenue improvement per year5. Our system
plays an revolutionary role and can be naturally transferred
to other flat panel industries, such as Liquid Crystal Display
(LCD) panels and Organic Light-Emitting Diode (OLED)
panels, to generate great social and economic benefits.
7. CONCLUSION
PDP-Miner has been deployed as an important supplemen-
tary component since the year 2013. It enables prompt data
analysis and efficient knowledge discovering in advanced man-
ufacturing processes. The improved production efficacy shows
5http://articles.e-works.net.cn/mes/article113579.htm.
that a practical data-driven solution that considers both sys-
tem flexibility and algorithm customization is expected to
fill the application gap between the manufacturer and data
analysts. We firmly believe that, if properly being applied,
the use of data analytics will become a dominating factor to
underpin new waves of productivity growth and innovation,
and to transform the way of manufacturings across indus-
tries in a fundamental manner.
8. REFERENCES
[1] R Belz and P Mertens. Combining knowledge-based
systems and simulation to solve rescheduling problems.
Decision Support Systems, 17(2):141–157, 1996.
[2] Injazz J Chen. Planning for erp systems: analysis and
future trend. Business process management journal,
7(5):374–386, 2001.
[3] Wei-Chou Chen, Shian-Shyong Tseng, and Ching-Yao
Wang. A novel manufacturing defect detection method
using association rule mining techniques. Expert
systems with applications, 29(4):807–815, 2005.
[4] Chad A Davis, Fabian Gerick, Volker Hintermair,
Caroline C Friedel, Katrin Fundel, Robert K¨
uffner,
and Ralf Zimmer. Reliable gene signatures for
microarray classification: assessment of stability and
performance. Bioinformatics, 22(19):2356–2363, 2006.
[5] Chris Ding and Hanchuan Peng. Minimum
redundancy feature selection from microarray gene
expression data. Journal of bioinformatics and
computational biology, 3(02):185–205, 2005.
[6] Gang Fang, Gaurav Pandey, Wen Wang, Manish
Gupta, Michael Steinbach, and Vipin Kumar. Mining
low-support discriminative patterns from dense and
high-dimensional data. TKDE, 24(2):279–294, 2012.
[7] C Groger, Florian Niedermann, Holger Schwarz, and
Bernhard Mitschang. Supporting manufacturing
design by analytics, continuous collaborative process
improvement enabled by the advanced manufacturing
analytics platform. In CSCWD, pages 793–799. IEEE,
2012.
[8] Christoph Gr¨
oger, Florian Niedermann, and Bernhard
Mitschang. Data mining-driven manufacturing process
optimization. In Proceedings of the World Congress on
Engineering, volume 3, pages 4–6, 2012.
[9] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Pfahringer, Peter Reutemann, and Ian H Witten. The
weka data mining software: an update. ACM
SIGKDD explorations newsletter, 11(1):10–18, 2009.
[10] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Pfahringer, Peter Reutemann, and Ian H. Witten. The
weka data mining software: An update. SIGKDD
Explorations, 2009.
[11] Jiawei Han, Micheline Kamber, and Jian Pei. Data
mining: concepts and techniques. Morgan kaufmann,
2006.
[12] Trevor Hastie, Robert Tibshirani, Jerome Friedman,
T Hastie, J Friedman, and R Tibshirani. The elements
of statistical learning, volume 2. 2009.
[13] Alexandros Kalousis, Julien Prados, and Melanie
Hilario. Stability of feature selection algorithms: a
study on high-dimensional spaces. Knowledge and
information systems, 12(1):95–116, 2007.
[14] Ju
Ergen Kletti. Manufacturing Execution Systems
(MES). Springer, 2007.
[15] David Lei, Michael A Hitt, and Joel D Goldhar.
Advanced manufacturing technology: organizational
design and strategic flexibility. Organization Studies,
17(3):501–523, 1996.
[16] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating
classification and association rule mining. In SIGKDD,
1998.
[17] Steven Loscalzo, Lei Yu, and Chris Ding. Consensus
group stable feature selection. In SIGKDD, pages
567–576. ACM, 2009.
[18] MILK. http://pythonhosted.org/milk.
[19] MLC++. http://www.sgi.com/tech/mlc.
[20] Sewon Oh, Jooyung Han, and Hyunbo Cho. Intelligent
process control system for quality improvement by
data mining in the process industry. In Data mining
for design and manufacturing, pages 289–309.
Springer, 2001.
[21] Sean Owen, Robin Anil, Ted Dunning, and Ellen
Friedman. Mahout in Action. Manning, 2011.
[22] Rajagopal Palaniswamy and Tyler Frank. Enhancing
manufacturing performance with erp systems.
Information systems management, 17(3):43–55, 2000.
[23] Zoltan Prekopcsak, Gabor Makrai, Tamas Henk, and
Csaba Gaspar-Papanek. Radoop: Analyzing big data
with rapidminer and hadoop. In RCOMM, 2011.
[24] Marko Robnik-ˇ
Sikonja and Igor Kononenko.
Theoretical and empirical analysis of relieff and
rrelieff. Machine learning, 53(1-2):23–69, 2003.
[25] Lixiang Shen, Francis EH Tay, Liangsheng Qu, and
Yudi Shen. Fault diagnosis using rough sets theory.
Computers in Industry, 43(1):61–72, 2000.
[26] Victor A Skormin, Vladimir I Gorodetski, and
Leonard J Popyack. Data mining technology for failure
prognostic of avionics. TAES, 38(2):388–403, 2002.
[27] Alexander Topchy, Anil K Jain, and William Punch.
A mixture model of clustering ensembles. In SDM,
pages 379–390, 2004.
[28] Toby D Wall, J Martin Corbett, Robin Martin,
Chris W Clegg, and Paul R Jackson. Advanced
manufacturing technology, work design, and
performance: A change study. Journal of Applied
Psychology, 75(6):691, 1990.
[29] Adam Woznica, Phong Nguyen, and Alexandros
Kalousis. Model mining for robust feature selection. In
SIGKDD, pages 913–921. ACM, 2012.
[30] Le Yu, Jian Zheng, Bin Wu, Bai Wang, Chongwei
Shen, Long Qian, and Renbo Zhang. Bc-pdm: Data
mining, social network analysis and text mining
system based on cloud computing. In SIGKDD, 2012.
[31] Lei Yu, Chris Ding, and Steven Loscalzo. Stable
feature selection via dense feature groups. In
SIGKDD, pages 803–811. ACM, 2008.
[32] Chunqiu Zeng, Yexi Jiang, Li Zheng, Jingxuan Li, Lei
Li, Hongtai Li, Chao Shen, Wubai Zhou, Tao Li, Bing
Duan, Ming Lei, and Pengnian Wang. FIU-Miner: A
Fast, Integrated, and User-Friendly System for Data
Mining in Distributed Environment. In SIGKDD,
2013.