ArticlePDF Available

Applying data mining techniques to address critical process optimization needs in advanced manufacturing

Authors:

Abstract and Figures

Advanced manufacturing such as aerospace, semi-conductor, and flat display device often involves complex production processes, and generates large volume of production data. In general, the production data comes from products with different levels of quality, assembly line with complex flows and equipments, and processing craft with massive controlling parameters. The scale and complexity of data is beyond the analytic power of traditional IT infrastructures. To achieve better manufacturing performance, it is imperative to explore the underlying dependencies of the production data and exploit analytic insights to improve the production process. However, few research and industrial efforts have been reported on providing manufacturers with integrated data analytical solutions to reveal potentials and optimize the production process from data-driven perspectives. In this paper, we design, implement and deploy an integrated solution, named PDP-Miner, which is a data analytics platform customized for process optimization in Plasma Display Panel (PDP) manufacturing. The system utilizes the latest advances in data mining technologies and Big Data infrastructures to create a complete analytical solution. Besides, our proposed system is capable of supporting automatically configuring and scheduling analysis tasks, and balancing heterogeneous computing resources. The system and the analytic strategies can be applied to other advanced manufacturing fields to enable complex data analysis tasks. Since 2013, PDP-Miner has been deployed as the data analysis platform of ChangHong COC. By taking the advantages of our system, the overall PDP yield rate has increased from 91% to 94%. The monthly production is boosted by 10,000 panels, which brings more than 117 million RMB of revenue improvement per year.
Content may be subject to copyright.
Applying Data Mining Techniques to Address Critical
Process Optimization Needs in Advanced Manufacturing
Li Zheng1, Chunqiu Zeng1, Lei Li1, Yexi Jiang1, Wei Xue1, Jingxuan Li1, Chao Shen1,
Wubai Zhou1, Hongtai Li1, Liang Tang1, Tao Li1, Bing Duan2, Ming Lei2and Pengnian Wang2
1School of Computer Science, Florida International University, Miami, FL, USA 33174
2ChangHong COC Display Devices Co., Ltd, Mianyang, Sichuan, China 621000
ABSTRACT
Advanced manufacturing such as aerospace, semi-conductor,
and flat display device often involves complex production
processes, and generates large volume of production data.
In general, the production data comes from products with
different levels of quality, assembly line with complex flows
and equipments, and processing craft with massive control-
ling parameters. The scale and complexity of data is be-
yond the analytic power of traditional IT infrastructures. To
achieve better manufacturing performance, it is imperative
to explore the underlying dependencies of the production
data and exploit analytic insights to improve the production
process. However, few research and industrial efforts have
been reported on providing manufacturers with integrated
data analytical solutions to reveal potentials and optimize
the production process from data-driven perspectives.
In this paper, we design, implement and deploy an inte-
grated solution, named PDP-Miner, which is a data analyt-
ics platform customized for process optimization in Plasma
Display Panel (PDP) manufacturing. The system utilizes
the latest advances in data mining technologies and Big
Data infrastructures to create a complete analytical solu-
tion. Besides, our proposed system is capable of supporting
automatically configuring and scheduling analysis tasks, and
balancing heterogeneous computing resources. The system
and the analytic strategies can be applied to other advanced
manufacturing fields to enable complex data analysis tasks.
Since 2013, PDP-Miner has been deployed as the data analy-
sis platform of ChangHong COC1. By taking the advantages
of our system, the overall PDP yield rate has increased from
91% to 94%. The monthly production is boosted by 10,000
panels, which brings more than 117 million RMB of revenue
improvement per year2.
1ChangHong COC Display Devices Co., Ltd is one of the
world’s largest display device manufacturing companies in
China (http://www.cocpdp.com).
2http://articles.e-works.net.cn/mes/article113579.htm.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
KDD ’14 New York City, New York USA
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
Categories and Subject Descriptors: H.2.8[Database
Applications]: Data Mining; H.4[Information Systems Ap-
plications]: Miscellaneous
Keywords: Advanced Manufacturing, Big Data, Data Min-
ing Platform, Process Optimization
1. INTRODUCTION
The manufacturing industry involves the production of
merchandise for use or sale using labor and machines, tools,
chemical processing, etc. It has been the mainstay of many
developed economies and remains an important driver of
GDP (Gross Domestic Product). According to the Bureau
of Economic Analysis data, every dollar goods in manu-
facturing generates $1.48 in economic activity, the highest
economic multiplier among major economic sector3. With
the advancement of new technologies, a lot of manufactur-
ers utilize cutting-edge materials and emerging capabilities
enabled by physical, biological, chemical and computer sci-
ences. The improved manufacturing process often refers to
as “advanced manufacturing” [15, 28]. For example, orga-
nizations in oil and gas industry apply new technologies to
transform raw data into actionable insight to improve asset
value and product yield while enhancing safety and protect-
ing the environment.
In advanced manufacturing, a medium-sized or large man-
ufacturing sector often arranges complex and elaborate pro-
duction processes according to the product structure, and
generates large volume of production data collected by sen-
sor technologies [8], Manufacturing Execution System (MES)
[14], and Enterprise Resources Planning (ERP) [22]. In
practice, the production data contains intricate dependen-
cies among a tremendous amount of controlling parameters
in the production workflow. Generally, it is extremely dif-
ficult or even impossible for analysts to manually explore
such dependencies, let alone proposing strategies to poten-
tially optimize the workflow.
Fortunately, the use of data analytics offers the manufac-
turers great opportunities to acquire informative messages
towards optimizing the production workflow. However in
practice, there is a significant application gap between
manufacturers and data analysts in observing the data and
using automation tools. Table 1 highlights the perspective
difference between manufacturers and analysts on three im-
3JEC Democratic staff calculations based on data from the
Bureau of Economic Analysis, Industry Data, Input-Output
Accounts, Industry-by-Industry Total Requirements after
Redefinitions (1998 to 2011).
Table 1: Perspective Differences Between Manufacturers and Data Analysts.
Capacity Capability Knowledge
Manufacturers
huge production output
sophisticated workflow
complex supply chain
control yield rate
optimize production line
effective parameter setting
private Know-How
high dependency to experts
high cost of testing
Data Analysts
large number of samples
high-dimensional data
complex param dependencies
process optimization
feature reduction and selection
feature association analysis
utilize domain expertise
knowledge sharing
knowledge management
Application Gap
utilize customized data analysis algorithms to mine the underlying knowledge;
provide configurable task platforms to allow automatic taskflow execution;
enable efficient knowledge representation and management.
portant aspects: (1) Capacity, i.e. what the data looks like;
(2) Capability, i.e., how the data can be utilized; and (3)
Knowledge, i.e., how to perform knowledge discovery and
management.
To bridge the gap, it is imperative to provide automated
tools to the manufacturers to enhance their capability of an-
alyzing production data. Data analytics in advanced man-
ufacturing, especially data mining approaches, have been
targeting several important fields, such as product qual-
ity analysis [20, 26], failure analysis of production [3, 25],
production planning and scheduling analysis [1, 2], analytic
platform implementation [7, 8], etc. However, few research
and industrial efforts have been reported on providing man-
ufacturers with an integrated data analytical platform,
to enable automatic analysis of the production data and ef-
ficient optimization of the production process. Our goal is
to provide such a solution to practically fill the gap.
1.1 A Concrete Case: PDP Manufacturing
Plasma Display Panel (PDP) manufacturing produces over
10,000 panels for a daily throughput in ChangHong COC
Display Devices Co., Ltd (COC for short). The produc-
tion line is near 6,000 meters and the process contains 75
assembling routines, and 279 major production equipments
with more than 10,000 parameters. The average produc-
tion time throughout the manufacturing process requires 76
hours. Specifically, the workflow consists of three major pro-
cedures shown in Figure 1, i.e., front panel,rear panel, and
panel assembly. Each procedure contains multiple sequen-
tially executed flows, and each flow is composed of multiple
key routines. The first two procedures are executed in paral-
lel, and each pair of front and rear panels will be assembled
in the assembly procedure. Figure 2 depicts the real assem-
bly line of one routine (Tin-doped Indium Oxide, ITO) in
front panel procedure, which gives us a sense of how complex
the complete production process will be.
Panel Assembly Processing
PDP Manufacturing Production Flow
Front Panel Processing Rear Panel Processing
3 major procedures 75 assembly routines 279 major equipments
over 10,000 parameters 6000m production line 76hr processing time
Figure 1: PDP Manufacturing Production Flow.
Figure 2: An Example Routine in PDP Workflow.
There are 83 types of equipments in the PDP manufactur-
ing process, each of which has a different set of parameters to
fulfill the corresponding processing tasks. The parameters
are often preset to certain values to ensure the normal oper-
ation of each equipment. However, the observed parameter
values often deviate from the preset values. Further in the
production environment, external factors, e.g., temperature,
humidity, and atmospheric pressure, may potentially affect
the product quality as the raw materials and equipments are
sensitive to these factors. The observed values of external
factors vary significantly in terms of sensor locations and
acquisition time. The production process generates a huge
amount of production data (10 Gigabytes per day with 30
Million records).
In daily operations, the manufactures are concerned with
how to improve the yield rate of the production. To achieve
this goal, several questions need to be carefully addressed,
including
What are the key parameters whose values can signif-
icantly differentiate qualified products from defective
products?
How the parameter value changes affect the production
rate?
What are the effective parameter recipes to ensure high
yield rate?
Answering these questions, however, is a non-trivial task
due to the scale and complexity of the production data, and
is impossible for domain analysts to manually explore the
data. Hence, it is necessary to automate the optimization
process using appropriate infrastructural and algorithmic so-
lutions.
1.2 Challenges and Proposed Solutions
The massive production data poses great challenges to
manufacturers in effectively optimizing the production work-
flow. During the past two years, we have been working
closely with the technicians and engineers from COC to in-
vestigate data-driven techniques for improving the yield rate
of production. During this process, we have identified two
key challenges and proposed the corresponding solutions to
each challenge as follows.
In general, highly automatic production process often gen-
erates large volume of data, containing a myriad of con-
trolling parameters with the corresponding observed values.
The parameters may have malformed or missing values due
to inaccurate sensing or transmission. Therefore, it is cru-
cial to efficiently store and preprocess these data, in order
to handle the increasing scale as well as the incomplete sta-
tus of the data. In addition, the analytics of the production
data is a cognitive activity towards the production workflow,
which embodies an iterative process of exploring the data,
analyzing the data, and representing the insights. A practi-
cal system should provide an integrated and high-efficiency
solution to support the process.
Challenge 1. Facing the enormous data with sustained
growth, how to efficiently support large-scale data analysis
tasks and provide prompt guidance to different routines in
the workflow?
Existing data mining products, such as Weka [9], SPSS
and SQL Sever Data Tools, provide functionalities to facil-
itate users to conduct the analysis. However, these prod-
ucts are designed for small or medium scale data analysis,
and hence cannot be applied to our problem setting. To ad-
dress Challenge 1, we design and implement an integrated
Data Analytics Platform based on a distributed system [32]
to support high-performance analysis. The platform man-
ages all the production data in a distributed environment,
which is capable of configuring and executing data prepro-
cessing and data analysis tasks in an automatic way. The
platform has the following functionalities: (1) cross-language
data mining algorithms integration, (2) real-time monitor-
ing of system resource consumption, and (3) balancing the
node workload in clusters.
Besides Challenge 1, in advanced manufacturing, the
controlling parameters in the production workflow may cor-
relate with each other, and potentially affect the production
yield rate. Several analysis tasks identified by PDP analysts
include (1) discovering the most related parameters (Task
1); (2) quantifying the parameter correlation with the prod-
uct quality (Task 2); and (3) proposing novel parameter
recipes (i.e., parameter value combinations) to improve the
yield rate (Task 3). A reasonable way to effectively fulfill
these tasks is to utilize suitable data mining and machine
learning techniques. However, existing algorithms cannot
be directly applied to these tasks, as they may either lack
the capability of handling large-scale data, or fail to consider
domain-specific data characteristics.
Challenge 2. Facing various types of mining require-
ments, how to effectively adapt existing algorithms for cus-
tomized analysis tasks that comprehensively consider the do-
main characteristics?
In our proposed system, Challenge 2 is effectively tack-
led by developing appropriate data mining algorithms and
adapting them to the problem of analyzing the manufactur-
ing data. In particular, to address Task 1, we propose an
ensemble feature selection method to generate a stable pa-
rameter set based on the results of various feature selection
methods. To address Task 2, we utilize regression mod-
els to describe the relationship between product quality and
various parameters. To address Task 3, we apply associ-
ation based methods to identify possible feature combina-
tions that can significantly improve the quality of product.
To make the system an integrated solution, we also provide
the functionalities of data exploration (including compara-
tive analysis and data cube) and result management.
Our proposed solution, PDP-Miner, is essentially a scal-
able, easy-to-use and customized data analysis system for
large-scale and complex mining tasks on manufacturing data.
Exploitation of the latest advances in data mining and ma-
chine learning technologies unleashes the potential to achieve
three critical objectives, including enhancing exploration and
production,improving refining and manufacturing efficiency,
and optimizing global operations. Since 2013, PDP-Miner has
been deployed as the production data analysis platform of
COC. By using our system, the overall yield rate has in-
creased from 91% to 94%, which has brought more than 117
million RMB of revenue per year4.
1.3 Roadmap
The rest of the paper is organized as follows. Section 2
presents an overview of our proposed system, starting from
introducing the system architecture, followed by the details
of three interleaved analysis modules, including data explo-
ration, operational analysis and result representation. In
Section 3, we explore possible feature selection strategies to
identify pivotal parameters in the production process, and
propose an ensemble feature selection approach to obtain
robust yet predominant parameter set. In Section 4, we dis-
cuss the task of measuring the importance of parameters,
and utilize regression models to examine how the param-
eter change will affect the yield rate. Section 5 describes
our strategy of mining the knowledge of data, that is, to
employ discriminative analysis (e.g., association mining) to
reveal the dependencies of parameters. Section 6 represents
the system deployment, in which system performance eval-
uation is described and some important real findings are
presented. Finally, Section 7 concludes the paper.
2. SYSTEM OVERVIEW
The overall architecture of PDP-Miner is shown in Fig-
ure 3. The system, from bottom to top, consists of two com-
ponents: Data Analytics Platform (including Task Manage-
ment Layer and Physical Resource Layer ) and Data Analysis
Modules.
Data Analytics Platform provides a fast, integrated, and
user-friendly system for data mining in distributed environ-
ment, where all the data analysis tasks accomplished by
Data Analysis Modules are configured as workflows and also
automatically scheduled. Details of this module are provided
in Section 2.1.
Data Analysis Modules provide data-mining solutions and
methodologies to identify important production factors, in-
cluding controlling parameters and their underlying correla-
tions, in order to optimize production process. These meth-
ods are incorporated into the platform as functions and mod-
ules towards specific analysis tasks. In PDP-Miner, there are
3 major analytic modules: data exploration,data analysis,
and result management. In Section 2.2, more details are pro-
4http://articles.e-works.net.cn/mes/article113579.htm
vided by presenting our data mining solutions customized
for PDP production data. A sample system for demon-
stration purpose is available at http://bigdata- node01.
cs.fiu.edu/PDP-Miner/demo.html.
Task Management Layer
Physical Resource Layer
Storage
Resources Database
Algorithm
Library
HDFS
Local File
System
Job
Scheduler
Job
Manager
Resource
Manager
Analytic Task Integrator Resource Monitor
Analytic Task Manager System Manager
Data Analysis Module
Data Exploration Data Analysis Result Manager
Data Cube
Comparison
Analysis
Parameter
Selection
Parameter
Value Recipe
Regression
Reporting
Feedback
Graphics
Workstations
Standalone
Computers
Computing
Clusters
Visualization
Figure 3: System Architecture.
2.1 Data Analytics Platform
Traditional data-mining tools or existing products [10, 21,
19, 18, 23, 30] have three major limitations when applied to
specific industrial sectors or production process analysis: 1)
They support neither large-scale data analysis nor handy
algorithm plug-in; 2) They require advanced programming
skills when configuring and integrating algorithms for com-
plex data mining tasks; and 3) They do not support large
scale of analysis tasks running simultaneously in heteroge-
neous environments.
To address the limitations of existing products, we develop
the data analytic platform based on our previous large-scale
data mining system, FIU-Miner [32], to facilitate the exe-
cution of data mining tasks. The data analytic platform
provides a set of novel functionalities with the following sig-
nificant advantages [32]:
Easy operation for task configuration. Users, espe-
cially non-data-analyst, can easily configure a complex
data mining task by assembling existing algorithms
into a workflow. Configuration can be done through
a graphic interface. Execution details including task
scheduling and resource management are transparent
to users.
Flexible supports for various programs. The existing
data mining tools, such as data preprocessing libraries,
can be utilized in this platform. There is no restriction
on programming languages for those programs exist or
to be implemented, since our data analytic platform
is capable of distributing the tasks to proper runtime
environments.
Effective resource management. To optimize the uti-
lization of computing resources, tasks are executed by
considering various factors such as algorithm imple-
mentation, server load balance, and the data location.
Various runtime environments are supported for run-
ning data analysis tasks, including graphics worksta-
tions, stand-alone computers, and clusters.
2.2 Data Analysis Modules
2.2.1 Data Exploration
The Comparison Analysis and Data Cube are capable of
assisting data analysts to explore PDP operation data effi-
ciently and effectively.
Comparison Analysis Comparison Analysis, shown
in Figure 6(a), provides a set of tools to help data analysts
quickly identify parameters whose values are statistically dif-
ferent between two datasets according to several statistical
indicators. Comparison Analysis is able to extract the top-k
most significant parameters based on predefined indicators
or customized ranking criteria. It also supports comparison
on the same set of parameters over two different datasets
to identify the top-kmost representative parameters of two
specified datasets.
Data Cube Data Cube, shown in Figure 6(b), provides
a convenient approach to explore high dimensional data so
that data analysts can have a glance at the characteristics of
the dataset. In addition, Data Cube can conduct multi-level
inspection of the data by applying OLAP techniques. Data
analysts can customize a multi-dimensional cube over the
original data. Thus, the constructed data cubes allow users
to explore multiple dimensional data at different granulari-
ties and evaluate the data using pre-defined measurements.
2.2.2 Data Analysis
The data mining approaches in algorithm library can be
organized as a configurable procedure in Operation Panel,
as shown in Figure 6(c). The Operation Panel is a unified
interface to build a workflow for executing such task auto-
matically. The Operation Panel contains the following three
main tasks:
Important Parameter Selection By modeling the
important parameter discovery task as a feature selection
problem, several feature selection algorithms are implemented
adaptively based on the production data. Moreover, an ad-
vanced ensemble framework is designed to combine multiple
feature selection outputs. Based on these implementations,
the system is able to generate a list of important parameters,
shown in Figure 6(d).
Regression Analysis The purpose of Regression Anal-
ysis (shown in Figure 6(f)) is to discover the correlations
between the yield rate and the controlling parameters. The
regression model not only indicates whether a correlation
exists between a parameter and the yield rate but also quan-
tifies the extent that the change of the parameter value will
influence the yield rate.
Discriminative Analysis Discriminative analysis (See
Figure 6(e)) is an alternative approach to identify the fea-
ture values that have strong indication to the target labels
(panel grade). By grouping and leveraging the features of
individual panels, this approach is able to find the most dis-
criminative rules (a set of features with the values) to the
target labels according to the data.
Feature Combination Mining
Find the important features
HDFS
Data
Loader
Feature
Selection
(mRMR)
Feature
Selection
(InfoGain)
Ensemble
Feature
Selection
Top K
Features
Feature
Selection
(ReliefF)
HDFS
Finding
Frequent
Feature
Combinations
Distribute
Data
Wait for
All
Outputs
Regression Analysis
Pruning
Combinations
Choice
[T>threshold]
DBWriter
DB
Regression
Model Using
Important
Features
Export
Influential
Features
Workflow 1: Parameter Selection
+ Regression Analysis
Workflow 2: Parameter Selection
+ Pattern Analysis
Figure 4: A Sample Workflow for PDP Manufactur-
ing Data Analysis.
To illustrate how Data Analysis Modules are incorporated
with the Data Analytics Platform, Figure 4 illustrates two
example analytic tasks wrapped as two workflows. As shown,
Workflow 1 indicates an analysis procedure of building re-
gression models with selected important parameters; Work-
flow 2 indicates another procedure of identifying reasonable
parameter value combinations based on previously selected
parameters. The Operation Panel provides a user-friendly
interface shown in Figure 5 to facilitate workflow assembly
and configuration. Users only need to explicitly create tasks
dependencies before the workflow executing automatically
by our platform.
Figure 5: Data Analysis Workflow Configuration.
2.2.3 Result Management
The analytic results are being categorized into three
types: the important parameter list, the parameter value
combinations, and the regression model. Templates are de-
signed to support automatically storage, update, and re-
trieval of discovered patterns. Results are recorded based on
analysis tasks and can be organized in terms of important
equipment, top parameters, and task list. For each result,
corresponding domain experts can refine and give feedback,
shown in Figure 6(h). In addition, visualizations are pro-
vided to summarize the analytic results, collected feedbacks,
and status of current knowledge (shown in Figure 6(g)). It
provides a flexible interface for maintaining domain knowl-
edge very efficiently.
3. ENSEMBLE FEATURE SELECTION
In manufacturing management, the primary goal is to im-
prove the yield rate of products by optimizing the manu-
facturing workflow. To this end, one important question is
to identify the key parameters (features) in the workflow,
which can significantly differentiate qualified products from
defective ones. However, it is a non-trivial task to select a
subset of features from the huge feature space. To tackle this
problem, we initially experimented several widely used fea-
ture selection approaches. Specifically, we use Information
Gain [11], mRMR [5] and ReliefF [24] to perform parameter
selection. Figure 7 shows the top 10 selected features by
these three algorithms on a sampled PDP dataset.
As observed in Figure 7, the three feature subsets share
only one common feature (“Char 020101-008”). Such a phe-
nomenon indicates the instability of feature selection meth-
ods, as it is difficult to identify the importance of a feature
from a mixed view of feature subsets. In general, the se-
lected are the most relevant to the labels and less redundant
to each other based on certain criteria. However, the corre-
lated features may be ignored if we select a small subset of
features. In terms of knowledge discovery, the selected fea-
ture subset is insufficient to represent important knowledge
about redundant features. Further, different algorithms se-
lect features based on different criteria, which renders the
feature selection result instable.
/ŶĨŽƌŵĂƚŝŽŶŐĂŝŶ;ƚŽƉϭϬͿ
ŵZDZ
;ƚŽƉϭϬͿ
ZĞůŝĞĨ
&;ƚŽƉϭϬͿ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϰ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϲ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϯ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϰ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϱ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϴ
ŚĂƌͺϭϬϬϭϬϮ
Ͳ
Ϭϳϵ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϮ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϵ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
ϭϵϵ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϲ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϭϬ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
ϮϬϴ
ŚĂƌͺϭϭϬϭϬϭ
Ͳ
ϬϬϭ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϳ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
ϮϭϮ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϲ
ŚĂƌͺϭϬϬϭϬϮ
Ͳ
Ϭϭϯ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϳ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϯ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
Ϯϭϯ
ŚĂƌͺϭϬϬϭϬϭ
Ͳ
ϭϲϴ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϰ
ŚĂƌͺϭϬϬϭϬϮ
Ͳ
Ϭϴϭ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϯ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
Ϭϭϯ
ŚĂƌͺϬϮϬϭϬϭ
Ͳ
ϬϬϴ
Figure 7: Selected Features by Different Algorithms.
The stability issue of feature selection has been studied re-
cently [4, 13] under the assumption of small sample size. The
results of these work indicate that different algorithms with
equal classification performance may have a wide variance
in terms of stability. Another direction of stable feature se-
lection involves exploring the consensus information among
different feature groups [17, 29, 31], which first identifies con-
sensus feature groups for a given dataset, and then performs
selection from the feature groups. However, these methods
fail to consider the correlation between selected features and
unselected ones, which might be important to guide us for
feature selection.
In our system, inspired by ensemble clustering [27], we
employ the ensemble strategy on the results of various fea-
ture selection methods to maintain the robustness and sta-
bility of feature selection. The problem setting of stable
feature selection is defined as follows. Given a dataset with
Mfeatures, we employ Nfeature selection methods which
for an arbitrary feature ireturn a N-lengthed vector yi, i =
1,2,...,M. Each entry of yiis 1 or 0 indicating whether
the feature iis selected or not by the corresponding feature
selection method. Since we are concerned with whether to
Data Analysis
Data Exploration Result Management
(g) Visualization
(a) Comparison Analysis
(f) Regression Analysis
(d) Parameter Selection
(e) Discriminative Analysis
(c) Operation Panel
(b) Data Cube (h) Result List , Feedback Collector
Figure 6: PDP-Miner Analysis Modules.
select a feature or not, we assume a feature i, in the form
of results of Nfeature selection methods, yi, is generated
from a mixture of two multivariate components, indicating
selected features and unselected features, i.e.,
p(yi) =
2
X
j=1
πjp(yi|θj),(1)
where πjdenotes the mixture probability of j-th component
parameterized by θj, in which the n-th entry θj n means the
probability of the output of n-th feature selection method
equals to 1. We further assume conditional independence
between feature selection methods. Therefore,
p(yi|θj) =
N
Y
n=1
p(yin|θj n).(2)
As the result of a feature selection method in the vector yi
is either selecting (1) or not selecting (0) the feature i, the
probability of the feature ibeing selected by the n-th feature
selection method, i.e., p(yin|θj n), could be represented by a
Bernoulli distribution
p(yin|θj n) = θyin
jn (1 θjn )1yin .(3)
In addition, we assume that all the features are i.i.d. Then
the log-likelihood of the unified probabilistic model is
L=
M
X
i=1
log
2
X
j=1
πjp(yi|θj).(4)
To learn the parameters πjand θj, j ∈ {1,2}, we use
Expectation-Maximization (EM) algorithm. To this end,
we introduce a series of hidden random variables zi, i = 1,2
to indicate yibelonging to each component, i.e., the param-
eters of the random variable zi1, zz2, zi1+zi2= 1.
The iterative procedure of EM will be terminated when
the likelihood of the mixture model does not change too
much constrained by a predefined threshold. The hidden
variable ziindicates the probabilities of membership of fea-
ture yiwith respect to all mixture components. It is in some
sense similar to the situation in Gaussian mixture models.
The feature is assigned to the j-th component that the cor-
responding value zij is the largest in zij, j ∈ {1,2}. As a
feature selection method will eventually generate two sub-
sets of features (selected or not), it is reasonable to make
two mixture components.
After obtaining the assignments of features to compo-
nents, say φ(zi), we group features into two categories, i.e.,
selected/unselected. In practice, the number of selected fea-
tures are significantly less than the unselected ones, and
hence the features that are not selected by any feature selec-
tion method are put into a large category. The features in
the other category are final feature selection results. Specif-
ically for each component j, we pick the features that have
the membership assignment, i.e., zij , greater than a pre-
defined threshold τ, and then put these features into the
selected category. In this way, we can discard features with
low probabilities for selection, and hence the stability of fea-
ture selection can be achieved by assembling different feature
selection results using the mixture model.
4. REGRESSION ANALYSIS
To optimize the production process, it is imperative to
discover the parameters that have significant influence on
the yield rate and quantify such influence. In our system,
an actionable solution is to explicitly establish a relationship
between controlling parameters and the yield rate, which can
be achieved using regression analysis.
Formally, assume the daily observations are i.i.d. Then
the relationship between features (parameters) and the yield
rate can be modeled as a function f(x,w) with additive
Gaussian noise, i.e.,
y=f(x,w) + ,  N (0, β1),(5)
where ydenotes the yield rate, x= (x1,· · · , xd)Tdenotes
the set of features that may have impact on y, and wde-
notes the weight of features. The noise term is a zero mean
Gaussian random variable with precision β.
In our system, we implement two linear regression based
models: ridge regression and lasso regression [12]. From the
perspective of maximum likelihood, the linear relationship
can be expressed as
ln p(y|w, β) = X
i
ln N(yi|wTxi, β1).(6)
For both models, we leverage least square to quantify the
error, i.e.,
E(w) = (1
2Pi(yiwTxi)2+1
2λ||w||2
2,ridge regression
1
2Pi(yiwTxi)2+1
2λ||w||1,lasso regression .
(7)
In advanced manufacturing domain, the number of fea-
tures is usually large (in PDP scenario, the number of fea-
tures is more than 10K), and therefore ensemble feature se-
lection (described in Section 3) is applied before building the
regression model. To conduct the regression, we incorporate
three categories of features:
1. The parameters of the equipments involved in the man-
ufacturing process. This category of features is col-
lected from the log of the equipments.
2. The parameters of the environment, such as tempera-
ture, humidity, and pressure, etc. This category of the
features is collected from the deployed sensors in each
workshop.
3. The features of the materials, such as the viscosity,
consistency, and concentration, etc. This category of
feature is collected from material descriptions and re-
ports.
After integrating all the features, we normalize each dimen-
sion of the features using standardization, i.e. x¯
X
std(X).
The linear regression can be solved efficiently. When the
dataset is small, the closed form can be directly obtained,
i.e. ˆ
wridge = (λI+XTX)1XTyfor ridge regression and
ˆ
wlasso =sgn((XTX)1XT)(|(XTX)1XTy|−λ) for lasso
regression, where Xdenotes the matrix of the features with
ith row indicating the feature set xi. For large datasets,
we train the model iteratively by using stochastic gradient
descent for ridge regression and coordinate gradient for lasso
regression.
The weights of the trained model can be intuitively in-
terpreted. Firstly, the value |wi|indicates the conditional
correlation between the feature xiand the yield rate given
the other features. In general, a larger weight indicates a
larger conditional correlation. Moreover, the corresponding
p-value of each feature can be leveraged to measure the like-
lihood of the correlation. The smaller the p-value, the less
likely such correlation is false.
By performing regression analysis on the PDP data, we
find some interesting correlations. For example:
1. The variance of the humidity of the air has positive
correlation wtih the yield rate. This provides empirical
evidence to support the conjecture of PDP technicians
that the variance of the humidity plays an important
role in affecting the yield rate.
2. The pressure of the air has positive correlation with
the yield rate, whereas its variance changes inversely.
The less the pressure changes, the higher the yield rate
would be.
3. The workshop temperature and its variance vary slightly
within a small range, and the corresponding weight is
very small. In practice, the change of the tempera-
ture may affect the usage of materials as well as the
production process. Hence, it is often being carefully
controlled by technicians.
5. DISCRIMINATIVE ANALYSIS
Discriminative analysis mines the feature knowledge of the
PDP panel data from a different perspective. It is used
as an alternative way to reveal the underlying relationship
between the features and the panel grades. Specifically, it
helps discover parameter recipes as well as sets of feature
values which are closely related to qualified panels and de-
fective panels. In PDP-Miner, the techniques of association
based classification [16] and low-support discriminative pat-
tern mining [6] are leveraged to conduct the discriminative
analysis.
5.1 Association based classification
Association based classification integrates classification and
association rule mining to discover a special subset of asso-
ciation rules called class association rules (CARs). A CAR
is an implication of the form {r:Fy}, where Fis a
subset of the entire feature value set and ydenotes the class
label (the PDP panel grade in our scenario). Each CAR is
associated with support sand confidence c, indicating how
many records contain Fand the ratio of records containing
Fthat are labeled as y. In general, CARs contain strong
discriminative information to infer the PDP panel grades.
A rule-based classifier can be built by selecting a subset
of the CARs that collectively cause the least error, i. e.
r1, r2, ..., rny.
Compared with feature selection and regression analy-
sis, association based classification enables the possibility of
early detection due to the unique characteristics of CARs. If
CARs only refer to the features in the early manufacturing
process, this method can quickly identify semi-finished yet
defective panels, and prevent further resource waste.
The early detection strategy is useful in the advanced
manufacturing domain, as any earlier detected bad semi-
finished product can directly reduce the manufacturing cost.
For the production with a large number of assembling pro-
cedures, such a reduction is not trivial.
5.2 Low support discriminative pattern min-
ing
A manufacturing process could consist of hundreds of as-
sembling procedures with thousands of tunning parameters.
When the feature dimension is high, standard association
rule based methods would become time-consuming. A na¨
ıve
solution for this scenario is to increase the support thresh-
old to speed up mining. However, this strategy may miss
interesting low-support patterns.
To address this problem, we adapt the idea of low support
pattern mining algorithm (SMP) [6] and integrate the algo-
rithm into PDP-miner.SMP aims at mining the discrim-
inative patterns by leveraging a family of anti-monotonic
measures called SupMaxK.SupMaxK organizes the discrim-
inative pattern set into nested layers of subsets.
5.2.1 Discriminative Patterns Detection
Many association mining methods utilize “support” to se-
lect rules/patterns. Different from the traditional associa-
tion mining, the “discriminative support” is defined to mea-
sure the quality (discriminative capability) of the rule set:
DisS(α) = |Squalif ied(α)Sdef ectiv e(α)|,(8)
where αis a set of parameter values, Squalified and Sdef ectiv e
denote the“support” of αover two classes, indicating whether
the target panel is qualified or defective.
A na¨
ıve implementation using this measure suffers from
low efficiency [6] when pruning frequent non-discriminative
rules. To address this issue, a new measure – SupM axK(α)
is introduced to help prune unrelated patterns by estimating
Sdefectiv e(β).
SupM axK (α) = Sq ualif ied(α)maxβα(Sdef ectiv e(β)),
(9)
where |β|=K,βis the subset of α. Three reasons make
this measure useful: (1) SupM axK can help select more
discriminative patterns as Kincreases; (2) SupM axK is a
lower bound of DisS; (3) SupM axK is anti-monotonic.
Due to the anti-monotonic property of SupM axK,SMP
can naturally be utilized to mine the discriminative patterns
whose support are low but have strong indication to the
panel grades.
6. SYSTEM DEPLOYMENT
We evaluate our proposed system from two aspects: the
system performance and the real findings. The evaluation
demonstrates that our system is a practical solution for
large-scale data analysis, through integrating and adapting
classic data mining techniques and customizing them for spe-
cific domains, particularly, advanced manufacturing.
6.1 System performance
Our system is able to perform large-scale data analysis
and can be easily scale up. To demonstrate the scalability
of PDP-Miner, we design a series of cluster workload balance
experiments in both static and dynamic computing environ-
ments. The experiments are conducted on a testbed clus-
ter separated from the real production system. The cluster
consists of 8 computing nodes with different computing per-
formances.
In the experiments, one frequent analysis task of PDP-
Miner is created using the job configuration interface, which
consists of two sequential functions, i.e., Parameter Selec-
tion Parameter Combination Extraction. For evaluation
purpose, ten different datasets (about 30 million records)
are generated by sampling from the original 1-year produc-
tion datasets. The analysis task is conducted over these
datasets in two types of experiments: Exp I Workload bal-
ance in a static environment and Exp II Workload balance
in a dynamic environment. In the following, we describe the
detailed experimental plans as well as the results.
Exp I: Each node in the cluster is deployed with one
Worker. We configure 10 parameter selection tasks with dif-
ferent running times in PDP-Miner. Each job starts at time 0
and repeats with a random interval (<1 minutes). Figure 8
shows how our system balances the workloads based on the
underlying infrastructures. The x-axis denotes the time and
the y-axis denotes the average number of completed jobs for
each Worker at the given moment during the task execution.
Clearly, the accumulated number of completed jobs (the blue
solid bars) increases linearly, whereas the amortized number
of completed jobs (the white empty bars) remains stable.
This shows that when the cluster remains unchanged, our
system achieves a good balance of the resource utilization
by properly distributing jobs. The effective distribution of
jobs guarantees a full use of existing resources to maximize
the throughput without incurring resource bottleneck.
Exp II: To investigate the resource utilization of PDP-
Miner under a dynamic environment, we initially provide
four nodes (node14), each with 1 Worker, and then add
the other four nodes (node58) 10 minutes later. To emu-
late the nodes with different computing powers, the newly-
added nodes are deployed with 2 to 5 Workers, respectively.
Each Worker is restricted to use only 1 CPU core at a time,
so the node deployed with more Workers can have more
powerful computing resources. Figure 9 shows the number
of jobs completed by each node during observing the sys-
Figure 8: Load Balance in Static Environment.
tem execution for 70 minutes. The number of jobs on each
node is segmented every 10 minutes. It clearly shows that
the number of completed jobs is proportional to the number
of Workers on each node, which indicates that our system
can balance the workloads in a dynamically changed cluster.
It also demonstrates that the entire system can be linearly
extended with resources of different computing power.
Figure 9: Load Balance in Dynamic Environment.
6.2 Real Findings
PDP-Miner has been playing an important role on reveal-
ing deeper and finer relations behind big data in COC’s real
practice. As an example, WorkFlow1 in Figure 4 is executed
to extract important parameters from a single procedure,
named barrier-rib (BR). 30 selected parameters are reported
and verified by domain experts. Within these 30 parame-
ters, 15 of them have already been carefully monitored by
the analysts, which is consistent with domain knowledge.
Another 9 parameters, which are not monitored in the pre-
vious production, are confirmed to have great impact on the
product quality. After applying WorkFlow1 to the entire
production data, 197 important parameters are reported by
our system, among which 133 parameters are consistent with
production experience, and 50 parameters are verified by do-
main experts to have direct impact on the product quality.
The details are shown in Figure 11 (blue portion consis-
tent with domain expertise; red portion confirmed to be
important which was previously ignored; white portion
excluded after verification).
To discover meaningful parameter values, WorkFlow2 in
Figure 4 is used. We separate the production data to two
Figure 10: Real Case of Regression Analysis Results.
sets by the product qualities (GOOD, i.e., qualified prod-
ucts, and SCRAP, i.e., defective products) and execute Work-
Flow2 on these two sets, respectively. The analysis generates
hundreds of frequent parameter value combinations for each
given dataset (the number of outputs can be restricted by
empirically setting a threshold of confidence). By extracting
the frequent combinations in SCRAP that are not frequent
in GOOD, we can obtain the value combinations that may
result in defective products. Figure 12 shows a verification of
a sample combination <para-xxxx-014=0, para-xxxx-015=0
or 24, para-xxxx-043=44 or 48>(big red crosses indicate
that the values present densely on SCRAP products). Such
a parameter value combination should be avoided in the pro-
duction practice.
Not Relevant
Important
(match expertise)
Important
(previously ignored)
133
50
14
Figure 11: Important Parameters Discovered.
para-xxxx-014
para-xxxx-043
para-xxxx-015
Figure 12: A Sample Parameter Combination.
By applying regression analysis in WorkFlow1 of Figure 4,
we discovered that environmental parameters, such as tem-
perature and humidity, have significant correlations with the
product quality. Further analysis confirmed that when the
surrounding temperature of BR Furnace is under 27C, the
number of defective products with BR Open or BR Short
increases dramatically. Figure 10 depicts such findings.
The aforementioned findings are some typical examples
obtained from the practical usage of our proposed system.
Most of our findings have been validated by PDP technicians
and are incorporated into their operational manual.
6.3 Deployment Practice
PDP-Miner has been successfully applied in ChangHong
COC’s PDP production line of the 3rd and 4th generations
of products for manufacturing optimization. Every time the
product line is upgraded, the yield rate drops significantly
since previous parameter settings could not match new prod-
ucts requirements. The earlier parameters are tuned prop-
erly, the greater the cost will be reduced. PDP-Miner has
been intensively used in such situations for problem diag-
nosis, including quickly identifying problematic parameter
settings, detecting abnormal parameter values, and moni-
toring sensitive parameters.
In summary, our system brings several great benefits in
optimizing the production process:
Through establishing the relationship between param-
eter settings and product quality, manufacturers are
more confident to properly control the production pro-
cess based on analytical evidence. The cost has been
greatly reduced as the number of defective products
decreases.
The prompt analysis of the production data enables
the quick diagnosis on parameter values, especially
when upgrading the assembly line or handling unex-
pected faults. As a result, the throughput increases.
A knowledge database is constructed to manage useful
analytic results that have been verified and validated
by existing domain expertise. Technicians can refer to
the database to look for possible solutions and control
the assembly line more efficiently.
By taking advantage of our system, the overall PDP yield
rate increases from 91% to 94%. Monthly production capac-
ity is boosted by 10,000 panels, which brings more than 117
million RMB of revenue improvement per year5. Our system
plays an revolutionary role and can be naturally transferred
to other flat panel industries, such as Liquid Crystal Display
(LCD) panels and Organic Light-Emitting Diode (OLED)
panels, to generate great social and economic benefits.
7. CONCLUSION
PDP-Miner has been deployed as an important supplemen-
tary component since the year 2013. It enables prompt data
analysis and efficient knowledge discovering in advanced man-
ufacturing processes. The improved production efficacy shows
5http://articles.e-works.net.cn/mes/article113579.htm.
that a practical data-driven solution that considers both sys-
tem flexibility and algorithm customization is expected to
fill the application gap between the manufacturer and data
analysts. We firmly believe that, if properly being applied,
the use of data analytics will become a dominating factor to
underpin new waves of productivity growth and innovation,
and to transform the way of manufacturings across indus-
tries in a fundamental manner.
8. REFERENCES
[1] R Belz and P Mertens. Combining knowledge-based
systems and simulation to solve rescheduling problems.
Decision Support Systems, 17(2):141–157, 1996.
[2] Injazz J Chen. Planning for erp systems: analysis and
future trend. Business process management journal,
7(5):374–386, 2001.
[3] Wei-Chou Chen, Shian-Shyong Tseng, and Ching-Yao
Wang. A novel manufacturing defect detection method
using association rule mining techniques. Expert
systems with applications, 29(4):807–815, 2005.
[4] Chad A Davis, Fabian Gerick, Volker Hintermair,
Caroline C Friedel, Katrin Fundel, Robert K¨
uffner,
and Ralf Zimmer. Reliable gene signatures for
microarray classification: assessment of stability and
performance. Bioinformatics, 22(19):2356–2363, 2006.
[5] Chris Ding and Hanchuan Peng. Minimum
redundancy feature selection from microarray gene
expression data. Journal of bioinformatics and
computational biology, 3(02):185–205, 2005.
[6] Gang Fang, Gaurav Pandey, Wen Wang, Manish
Gupta, Michael Steinbach, and Vipin Kumar. Mining
low-support discriminative patterns from dense and
high-dimensional data. TKDE, 24(2):279–294, 2012.
[7] C Groger, Florian Niedermann, Holger Schwarz, and
Bernhard Mitschang. Supporting manufacturing
design by analytics, continuous collaborative process
improvement enabled by the advanced manufacturing
analytics platform. In CSCWD, pages 793–799. IEEE,
2012.
[8] Christoph Gr¨
oger, Florian Niedermann, and Bernhard
Mitschang. Data mining-driven manufacturing process
optimization. In Proceedings of the World Congress on
Engineering, volume 3, pages 4–6, 2012.
[9] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Pfahringer, Peter Reutemann, and Ian H Witten. The
weka data mining software: an update. ACM
SIGKDD explorations newsletter, 11(1):10–18, 2009.
[10] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
Pfahringer, Peter Reutemann, and Ian H. Witten. The
weka data mining software: An update. SIGKDD
Explorations, 2009.
[11] Jiawei Han, Micheline Kamber, and Jian Pei. Data
mining: concepts and techniques. Morgan kaufmann,
2006.
[12] Trevor Hastie, Robert Tibshirani, Jerome Friedman,
T Hastie, J Friedman, and R Tibshirani. The elements
of statistical learning, volume 2. 2009.
[13] Alexandros Kalousis, Julien Prados, and Melanie
Hilario. Stability of feature selection algorithms: a
study on high-dimensional spaces. Knowledge and
information systems, 12(1):95–116, 2007.
[14] Ju
Ergen Kletti. Manufacturing Execution Systems
(MES). Springer, 2007.
[15] David Lei, Michael A Hitt, and Joel D Goldhar.
Advanced manufacturing technology: organizational
design and strategic flexibility. Organization Studies,
17(3):501–523, 1996.
[16] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating
classification and association rule mining. In SIGKDD,
1998.
[17] Steven Loscalzo, Lei Yu, and Chris Ding. Consensus
group stable feature selection. In SIGKDD, pages
567–576. ACM, 2009.
[18] MILK. http://pythonhosted.org/milk.
[19] MLC++. http://www.sgi.com/tech/mlc.
[20] Sewon Oh, Jooyung Han, and Hyunbo Cho. Intelligent
process control system for quality improvement by
data mining in the process industry. In Data mining
for design and manufacturing, pages 289–309.
Springer, 2001.
[21] Sean Owen, Robin Anil, Ted Dunning, and Ellen
Friedman. Mahout in Action. Manning, 2011.
[22] Rajagopal Palaniswamy and Tyler Frank. Enhancing
manufacturing performance with erp systems.
Information systems management, 17(3):43–55, 2000.
[23] Zoltan Prekopcsak, Gabor Makrai, Tamas Henk, and
Csaba Gaspar-Papanek. Radoop: Analyzing big data
with rapidminer and hadoop. In RCOMM, 2011.
[24] Marko Robnik-ˇ
Sikonja and Igor Kononenko.
Theoretical and empirical analysis of relieff and
rrelieff. Machine learning, 53(1-2):23–69, 2003.
[25] Lixiang Shen, Francis EH Tay, Liangsheng Qu, and
Yudi Shen. Fault diagnosis using rough sets theory.
Computers in Industry, 43(1):61–72, 2000.
[26] Victor A Skormin, Vladimir I Gorodetski, and
Leonard J Popyack. Data mining technology for failure
prognostic of avionics. TAES, 38(2):388–403, 2002.
[27] Alexander Topchy, Anil K Jain, and William Punch.
A mixture model of clustering ensembles. In SDM,
pages 379–390, 2004.
[28] Toby D Wall, J Martin Corbett, Robin Martin,
Chris W Clegg, and Paul R Jackson. Advanced
manufacturing technology, work design, and
performance: A change study. Journal of Applied
Psychology, 75(6):691, 1990.
[29] Adam Woznica, Phong Nguyen, and Alexandros
Kalousis. Model mining for robust feature selection. In
SIGKDD, pages 913–921. ACM, 2012.
[30] Le Yu, Jian Zheng, Bin Wu, Bai Wang, Chongwei
Shen, Long Qian, and Renbo Zhang. Bc-pdm: Data
mining, social network analysis and text mining
system based on cloud computing. In SIGKDD, 2012.
[31] Lei Yu, Chris Ding, and Steven Loscalzo. Stable
feature selection via dense feature groups. In
SIGKDD, pages 803–811. ACM, 2008.
[32] Chunqiu Zeng, Yexi Jiang, Li Zheng, Jingxuan Li, Lei
Li, Hongtai Li, Chao Shen, Wubai Zhou, Tao Li, Bing
Duan, Ming Lei, and Pengnian Wang. FIU-Miner: A
Fast, Integrated, and User-Friendly System for Data
Mining in Distributed Environment. In SIGKDD,
2013.
... It also illustrates that Table 3 Allocations of proposed solutions of reviewed literatures (Based on our literature investigation). MOM/MES [69,116], [66], regression [127], Distance, Regression, Self-organizing map, principal component analysis [128], [129,71], [104,130,131,104,132,112] Classification [133], OPL [106], KM [134], GA [73], O&M [39,135] [ 136,81,82,80] logistic regression, naïve Bayes, and a decision tree [109], regression [137], LSM [138], SVM [83], Anomaly detection [139], DTW [140], RF [141], K-means, Markov [142], KD [24], Prediction 3D printing, product performance, production planning, energy consumption, MES, QC, SCM Random Forest, Bayesian Network, statistic [39,109,63,62,93,136,137,138,141,24,87,143,90,55,161,85,88,25,162, some data formats in the specific manufacturing systems are missing in the proposed bid data solutions. It requires solutions to fill the gap to realize data exchange among the systems. ...
Article
Advanced manufacturing is one of the core national strategies in the US (AMP), Germany (Industry 4.0) and China (Made-in China 2025). The emergence of the concept of Cyber Physical System (CPS) and big data imperatively enable manufacturing to become smarter and more competitive among nations. Many researchers have proposed new solutions with big data enabling tools for manufacturing applications in three directions: product, production and business. Big data has been a fast-changing research area with many new opportunities for applications in manufacturing. This paper presents a systematic literature review of the state-of-the-art of big data in manufacturing. Six key drivers of big data applications in manufacturing have been identified. The key drivers are system integration, data, prediction, sustainability, resource sharing and hardware. Based on the requirements of manufacturing, nine essential components of big data ecosystem are captured. They are data ingestion, storage, computing, analytics, visualization, management, workflow, infrastructure and security. Several research domains are identified that are driven by available capabilities of big data ecosystem. Five future directions of big data applications in manufacturing are presented from modelling and simulation to real-time big data analytics and cybersecurity.
... Heterogeneous Directed Acyclic Graph (DAG) structured large and complex dependencies are increasingly common in data-parallel clusters, and the median DAG in such a cluster can have a depth of five and thousands of tasks [6]. Many previous methods have been proposed to detect the dependency between tasks [7][8][9][10]4]. However, previous scheduling algorithms and preemption algorithms simply schedule precedent tasks prior to their dependent tasks or neglect the dependency. ...
Conference Paper
Full-text available
Task scheduling and preemption are two important functions in data-parallel clusters. Though directed acyclic graph (DAG) task dependencies are common in data-parallel clusters, previous task scheduling and preemption methods do not fully utilize such task dependency to increase throughput since they simply schedule precedent tasks prior to their dependent tasks or neglect the dependency. We notice that in both scheduling and preemption, choosing a task with more dependent tasks to run allows more tasks to be runnable next, which facilitates to select a task that can more increase throughput. Accordingly, in this paper, we propose a Dependency-aware Scheduling and Preemption system (DSP) to achieve high throughput. First, we build an integer linear programming problem to minimize the makespan (i.e., the time when all jobs finish execution) with the consideration of task dependency and deadline, and derive the target server and start time for each task, which can minimize the makespan. Second, we utilize task dependency to determine tasks’ priorities for preemption. Finally, we propose a method to reduce the number of unnecessary preemptions that cause more overhead than the throughput gain. Extensive experimental results based on a real cluster and Amazon EC2 cloud service show that DSP achieves much higher throughput compared to existing strategies.
... A software framework called "PDP-Miner" has been deployed since 2013 for plasma display panel manufactoring described in Zheng et al. (2014). It enables data exploration for expert users, data analysis with an algorithm library and result management, i.e. visualization in graphs. ...
Article
Full-text available
This paper puts a finer point to remote telemaintenance with asynchronous access to an industrial production plant and its data. At first, the condition monitoring records the data from the industrial manipulator and transfers it via the Internet to an application. One possible applications for this data, which can run remotely in order to analyze and to optimize the plant and its processes is a complex sensor view with video integration of one plant cycle (application “FADAT” Facility Asynchronous Data Analysis Tool). Research focus of this paper lies on the user side: The goal of this work is to provide the external expert with a better understanding of the plant and its processes. This is realized for the first time in the examined work environment.
... Given RQ1, about the studies that use data analysis for SPI, the SLR have highlighted the 25 primary studies, which include some interesting approaches that have been applied in a variety of fields for the recent years, specifically in the fields of software engineering [7][8][9], manufacturing [10], and business intelligence for Small and MiddleSize Enterprises (SME) or novice practitioners [11,12]. Regarding to RQ2, from the 25 primary studies, 17 of them are focused in the application of data analysis for software processes. ...
Conference Paper
Current information systems demand high quality software products that guarantee a safety and a reliable use for our day-to-day life. A common understanding between software organizations and practitioners is that software product quality largely depends on the software process quality. A Software Process Improvement (SPI) initiative consists of a set of practices and activities that are designed to improve software organizations processes through the evaluation of their current practices and the way software products and services are developed. However, the big amount of information that is generated from the software organization practices has complicated the knowledge extraction, and therefore, the SPI initiatives. A possible technique to make a good knowledge management is data analysis. This paper presents the results of a systematic literature review to establish the state-of-the-art of data analysis for software process improvement. The findings also encourage to the creation of a BigData-based data analysis model in a future work for this research.
Article
Fueled by increasing data availability and the rise of technological advances for data processing and communication, business analytics is a key driver for smart manufacturing. However, due to the multitude of different local advances as well as its multidisciplinary complexity, both researchers and practitioners struggle to keep track of the progress and acquire new knowledge within the field, as there is a lack of a holistic conceptualization. To address this issue, we performed an extensive structured literature review, yielding 904 relevant hits, to develop a quadripartite taxonomy as well as to derive archetypes of business analytics in smart manufacturing. The taxonomy comprises the following meta-characteristics: application domain, orientation as the objective of the analysis, data origins, and analysis techniques. Collectively, they comprise eight dimensions with a total of 52 distinct characteristics. Using a cluster analysis, we found six archetypes that represent a synthesis of existing knowledge on planning, maintenance (reactive, offline, and online predictive), monitoring, and quality management. A temporal analysis highlights the push beyond predictive approaches and confirms that deep learning already dominates novel applications. Our results constitute an entry point to the field but can also serve as a reference work and a guide with which to assess the adequacy of one's own instruments.
Chapter
Die rasant zunehmende Digitalisierung von Wirtschaft und Gesellschaft ist die treibende Kraft der Verzahnung von Produktion und modernster Informations- und Kommunikationstechnik. Heutzutage kann zwar auf Echtzeitdaten von Sensoren und Steuerungen zugegriffen werden, allerdings werden diese bisher nicht in standardisierter Form durch Nutzung digitalen Dienstleistungen (Services) und Plattformen genutzt. Nicht zuletzt liegt es an der aufwändigen Entwicklung und Integration der Services in die Dienstleistungsplattform An diesem Punkt setzt das BMBF-Forschungsprojekt ‚MultiCloud-basierte Dienstleistungen für die Produktion‘ an. Ziel ist eine Serviceplattform, die den Entwicklungs- und Integrationsaufwand von neuen Services verringert sowie deren kostengünstigen Betrieb und Nutzung ermöglicht. Dafür arbeiten im Rahmen des Projektes Serviceanbieter, Anwender und Forschungseinrichtungen zusammen.
Article
Kurzfassung Um die Effizienz in der diskreten Produktion zu erhöhen und ungenutzte Potenziale in einer Fabrik zu identifizieren, können mehrwertbringende digitale Services eingesetzt werden. Der vorliegende Beitrag befasst sich mit den Chancen dieser Services sowie mit den Herausforderungen bei deren Etablierung. Vorgestellt wird ein ganzheitliches und durchgängiges Konzept, welches von der Feldebene mit Mechanismen für koordinierte Zeitstempel über eine angepasste Steuerungsarchitektur bis zu beispielhaften Analyse- und Distributionsservices reicht.
Chapter
Die rasant zunehmende Digitalisierung von Wirtschaft und Gesellschaft ist die treibende Kraft der Verzahnung von Produktion und modernster Informations- und Kommunikationstechnik. Unter dem Begriff Industrie 4.0 verändert sie nachhaltig die Art und Weise, wie in Deutschland zukünftig gearbeitet und produziert wird. Die technischen Grundlagen für die „Smart Factorys“ sind intelligente und vernetze Systeme, mit denen eine selbstorganisierte Produktion möglich sein soll. Um die Produktion noch effizienter zu gestalten, kommunizieren und kooperieren in der Industrie 4.0 Maschinen und Anlagen im selben Produktionsprozess über die Unternehmensgrenzen hinweg miteinander. Damit die Informationen in allen Phasen des Lebenszyklus eines Produkts zur Verfügung stehen, müssen Plattformen geschaffen werden, welche die Entstehung einer solchen Wertschöpfungskette fördern. Eine entsprechende Plattform wird im Projekt MultiCloud realisiert.
Article
Full-text available
Working with large data sets is increasingly common in research and industry. There are some distributed data analytics solutions like Hadoop, that offer high scalability and fault-tolerance, but they usually lack a user interface and only developers can exploit their functionali-ties. In this paper, we present Radoop, an extension for the RapidMiner data mining tool which provides easy-to-use operators for running dis-tributed processes on Hadoop. We describe integration and development details and provide runtime measurements for several data transforma-tion tasks. We conclude that Radoop is an excellent tool for big data analytics and scales well with increasing data set size and the number of nodes in the cluster.
Conference Paper
Full-text available
The advent of Big Data era drives data analysts from different domains to use data mining techniques for data analysis. However, performing data analysis in a specific domain is not trivial; it often requires complex task configuration, onerous integration of algorithms, and efficient execution in distributed environments.Few efforts have been paid on developing effective tools to facilitate data analysts in conducting complex data analysis tasks. In this paper, we design and implement FIU-Miner, a Fast, Integrated, and User-friendly system to ease data analysis. FIU-Miner allows users to rapidly configure a complex data analysis task without writing a single line of code. It also helps users conveniently import and integrate different analysis programs. Further, it significantly balances resource utilization and task execution in heterogeneous environments. A case study of a real-world application demonstrates the efficacy and effectiveness of our proposed system.
Article
Full-text available
A common problem with most of the feature selection methods is that they often produce feature sets--models--that are not stable with respect to slight variations in the training data. Different authors tried to improve the feature selection stability using ensemble methods which aggregate different feature sets into a single model. However, the existing ensemble feature selection methods suffer from two main shortcomings: (i) the aggregation treats the features independently and does not account for their interactions, and (ii) a single feature set is returned, nevertheless, in various applications there might be more than one feature sets, potentially redundant, with similar information content. In this work we address these two limitations. We present a general framework in which we mine over different feature models produced from a given dataset in order to extract patterns over the models. We use these patterns to derive more complex feature model aggregation strategies that account for feature interactions, and identify core and distinct feature models. We conduct an extensive experimental evaluation of the proposed framework where we demonstrate its effectiveness over a number of high-dimensional problems from the fields of biology and text-mining.
Article
Ensemble clustering is one of the research hotspots of data mining in recent years. The selection of high-quality and large-diversity base clustering results plays a key role in the quality of the final result. Traditional ensemble clustering selection algorithms usually treat each base clustering result as a whole which ignores the difference between the clusters in the same clustering result. It may cause the validity of the final clustering result to be affected. Aiming at this problem, inspired by the measurement method of uncertainty in the rough set theory, a dual-granularity weighted ensemble clustering model is proposed. The main contribution of this paper is shown as follows: (1) the evaluation of the reliability of clusters is transformed into an uncertainty measurement problem in the rough set; (2) in a finer-grained level, a sample local similarity measurement method is designed; (3) a weighted co-association matrix elements generation method based on global cluster reliability and local sample pair similarity is proposed, then the fusion function is used to get the final clustering result. Experimental results show that the proposed method is not sensitive to the size and diversity of base clustering members which has good robustness and stability. The final result obtained by this model is closer to the actual distribution of data sets.
Book
This is the third edition of the premier professional reference on the subject of data mining, expanding and updating the previous market leading edition. This was the first (and is still the best and most popular) of its kind. Combines sound theory with truly practical applications to prepare students for real-world challenges in data mining. Like the first and second editions, Data Mining: Concepts and Techniques, 3rd Edition equips professionals with a sound understanding of data mining principles and teaches proven methods for knowledge discovery in large corporate databases. The first and second editions also established itself as the market leader for courses in data mining, data analytics, and knowledge discovery. Revisions incorporate input from instructors, changes in the field, and new and important topics such as data warehouse and data cube technology, mining stream data, mining social networks, and mining spatial, multimedia and other complex data. This book begins with a conceptual introduction followed by a comprehensive and state-of-the-art coverage of concepts and techniques. Each chapter is a stand-alone guide to a critical topic, presenting proven algorithms and sound implementations ready to be used directly or with strategic modification against live data. Wherever possible, the authors raise and answer questions of utility, feasibility, optimization, and scalability. relational data. -- A comprehensive, practical look at the concepts and techniques you need to get the most out of real business data. -- Updates that incorporate input from readers, changes in the field, and more material on statistics and machine learning, -- Scores of algorithms and implementation examples, all in easily understood pseudo-code and suitable for use in real-world, large-scale data mining projects. -- Complete classroom support for instructors as well as bonus content available at the companion website. A comprehensive and practical look at the concepts and techniques you need in the area of data mining and knowledge discovery.
Book
The production plants of today develop into modern service centers. Economic efficiency of modern added value is not a property of products alone but of the process. Decisive potential in business now is a question of process capability, rather than production capability. Process capability in business requires real-time systems for optimization. Business-IT needs to be developed from telecommunications and ERP to real time services, which are not offered by the prevailing ERP systems. Today, only modern Manufacturing Execution Systems (MES) offer real-time applications. They generate current as well as historic mappings of production facilities and thus they can be used as basis for optimizations. It is important to map the supply chain in real time. Increasing complexity in production requires an integrated view of the production and service facilities: detailed scheduling, status collection, quality, performance analysis, tracing of material and so on have to be recorded and displayed in an integrated way. MESA (Manufacturing Enterprise Solutions Association) standardized applications. Further standardizations on this subject are already being developed, like ISA S95. Expectations regarding MES are high, related to TQM, SIX Sigma, production scheduling or optimized material movements. This book describes the requirements for optimized Manufacturing Execution Systems.
Conference Paper
High competitive pressure in the global manufacturing industry makes efficient, effective and continuously improved manufacturing processes a critical success factor. Yet, existing analytics in manufacturing, e. g., provided by Manufacturing Execution Systems, are coined by major shortcomings considerably limiting continuous process improvement. In particular, they do not make use of data mining to identify hidden patterns in manufacturing-related data. In this article, we present indication-based and pattern-based manufacturing process optimization as novel data mining approaches provided by the Advanced Manufacturing Analytics Platform. We demonstrate their usefulness through use cases and depict suitable data mining techniques as well as implementation details.
Conference Paper
The large amount of bulky and noisy shop floor data is one of the characteristics of the process industry. These data should be effectively processed to extract working knowledge needed for the enhancement of productivity and the optimization of quality. The objective of the chapter is to present an intelligent process control system integrated with data mining architecture in order to improve quality. The proposed system is composed of three data mining modules performed in the shop floor in real time: preprocessing, modeling, and knowledge identification. To consider the relationship between multiple process variables and multiple quality variables, the Neural-Network/Partial Least Squares (NNPLS) modeling method is employed. For our case study, the proposed system is configured as three control applications: feedback control, feed-forward control, and in-process control, and then applied to the shadow mask manufacturing process. The experimental results show that the system identifies the main causes of quality faults and provides the optimized parameter adjustments.