Content uploaded by Ufuk Celik
Author content
All content in this area was uploaded by Ufuk Celik on Jan 07, 2018
Content may be subject to copyright.
Digital Proceeding Of THE ICOEST’2013 - , Cappadocia
C.Ozdemir, S. Şahinkaya, E. Kalıpcı, M.K. Oden (editors)
Nevsehir, Turkey, June 18 – 21, 2013
Wastewater Effluent Prediction Based on Decision Tree
U.Çelik*, Assist.Prof. N.Yurtay, C. Sertkaya
Computer Engineering Department, Faculty of Computer and Information Sciences, Institute of Arts and Sciences, Sakarya
University, TURKEY.
Esentepe Campus 54187, Sakarya, Turkey
(E-mails: ufuk.celik1@ogr.sakarya.edu.tr, nyurtay@sakarya.edu.tr, d085012051@sakarya.edu.tr)
ABSTRACT
Wastewater treatment systems speed up natural cleansing process to achieve the desired treatment objectives. Prediction of
the obtained wastewater treatment characteristics provides to set up existing process steps and it is important to achieve
maximum process efficiency. In this study, a computer aided decision tree based on gini algorithm is developed for
estimating important output parameters of wastewater such as pH, DBO, DQO, and SS. Used dataset in this study was
obtained from the University of California Irvine (UCI) Machine Learning library.
Keywords: wastewater, desicion tree, classification.
1. INTRODUCTION
Wastewater treatment plants (WWTP) need to be controlled properly to get required
discharge standards. Mismanagement of WWTP causes significant problems on water,
environment and human health. Several complex processes takes place in WWTP. These
processes are affected by many physical, biological, and chemical factors. Best treatment
performance can be achieved by monitoring of these continuous process parameters and
controling of these factors. However, wastewater parameters have non-linear behaviors.
Nonlinearity enforces to use of complex mathematical functions to estimate parameters.
Prediction of any of these parameters helps the operator to control the system and to take
necessary precautions before any problem arised and required treatment performance can be
achived.
Nowadays, many studies based on intelligent methods were conducted in wastewater
treatment. Theese researches are about prection of WWTP parameters and process control of
WWTP. Simple prediction models based on neural network were developped for Total
Suspended Solid (TSS) prediction in [1] and [2]. NW multi-layer forward artificial neural
network soft sensing model were proposed for predicting performance and controlling of
WWTP processes in [3]. An integrated neural-fuzzy process controller was developed to
control aeration in an Aerated Submerged Biofilm Wastewater Treatment Process in [4]. Self-
organizing networks were designed for one step ahead prediction of the outputs of the pre-
precipitation stage of a wastewater treatment plant in [5]. An approach based on an agent with
learning capabilities is proposed for the N-Ammonia removal process in [6].
139
Common characteristic of proposed models is using more tan one input parameter to
estimate only one output parameter. The multiple inputs - single output models success rate is
higher than others.
In this study, the predictive Decision Tree models based on Gini algorithm, presented
for the estimation of effluent parameters, PH, DBO, DQO, and SS. According to test results,
the developed model performance is at desirable level.
2. METHOD
Data collection and analysis is essential for successful development of a model. The
first step in the development process is collecting the data and saving it in computer
environment in a regular format. The next step involves data cleansing. Data cleansing is the
process of detecting and correcting (or removing) corrupt or inaccurate records from dataset
[7]. After this is step, Principal Component Analysis (PCA) is performed to find best input
parameters for relevant output parameter. Then proposed Decision Tree models have been
developped, models are simulated and the obtained results are discussed. Theese model
development steps are shown schematically in Figure 1.
Data Collection
Data Cleansing
Decision Tree Model Development
Model simulation and discussion
PCA Analysis
Figure 1. The steps of model development
2.1 Data Collection and Analysis
Data was obtained from the University of California Irvine (UCI) Machine Learning
library. This dataset was collected from the daily measures of sensors in a urban waste water
treatment plant. There are 527 daily data in this dataset [8]. 12 features (9 inputs and 3
outputs) were selected. Used UCI dataset features are shown below in Table 1.
Table 1. Dataset attributes
140
No
Parameter
Description
1
Q-E
input flow to plant
2
ZN-E
input Zinc to plant
3
PH-E
input pH to plant
4
DBO-E
input Biological demand of oxygen to plant
5
DQO-E
input chemical demand of oxygen to plant
6
SS-E
input suspended solids to plant
7
SSV-E
input volatile suspended solids to plant
8
SED-E
input sediments to plant
9
COND-E
input conductivity to plant
10
DBO-S
output Biological demand of oxygen
11
DQO-S
output chemical demand of oxygen
12
SS-S
output suspended solids
When examining the dataset, there are some missing and inaccurate values were found
in some records. Theese records were removed from existing dataset. After this stage, the
study continued with the remaining 323 data. Output parameters’s range is shown in Figure 2.
Figure 2. Output parameters’s range
After the data cleansing process, principal component analysis (PCA) was performed
for each output parameters. PCA analysis is known as a variable reduction procedure [9]. The
goal is selecting input parameters which have high correlation with output parameter.So that
saves the system from unnecessary input parameters and improves the system performance
[10, 11].
In this study linear correlation method is used for PCA analysis [12]. Correlation
equation is as follows:
0
20
40
60
80
100
120
140
160
180
200
124 47 70 93 116 139 162 185 208 231 254 277 300 323
DBO-S
DQO-S
SS-S
days
141
where E is the expected value operator, X and Y are two random variables with expected
values and and standard deviations and .
After the correlation process, the relationship between the input and output parameters are
found as given in Table 2.
Table 2. Correlation results
Input
Output
PH-S
DBO-S
DQO-S
SS-S
Q-E
0,07
0,01
-0,06
-0,01
ZN-E
-0,14
-0,02
0,03
-0,04
PH-E
0,35
0,01
0,01
-0,06
DBO-E
0,00
0,15
0,28
0,15
DQO-E
-0,01
0,09
0,30
0,07
SS-E
0,10
0,02
-0,01
0,02
SSV-E
-0,11
-0,01
0,14
-0,02
SED-E
0,02
0,03
0,06
0,00
COND-E
0,10
0,02
0,17
-0,01
Correlation results summarize the relation between input and output parameters :
Output parameter, PH-S, has negative correlation with input parameters ZN-E, DQO-
E, SSV-E and has positive correlation with others.
Output parameter, DBO-S has negative correlation with input parameters ZN-E, SSV-
E and has positive correlation with others.
Output parameter, DQO-S has negative correlation with input parameters QE and the
SS-E and has positive correlation with others.
Output parameter, SS-S has negative correlation with input parameters Q-E, ZN-E,
PD-E, SSV-E, COND-E and has positive correlation with others.
Parameters which have the correlation above 0.05 were chosen for using in models. Based on
the obtained results, Decision Tree models for each output parameters are determined as
Figure 3.
Decision Tree
Decision Tree
Decision Tree
Decision Tree
142
Figure 3. Developped decision tree models
2.2 Gini Model
In this study decision tree has been obtained with Gini algorithm by using RapidMiner
software. Gini algorithm is a method based on binary division of dataset. Attribute values of
assets are owned by the Gini algorithm so that each one is divided into two groupings.
Branches, divisions occur as a result of these groupings. Each attribute value of the binary
elements of the group shall be deemed to have separated branches. This branch is used for
attribute values in the group’s formulation element numbers [13]. Gini formulation provides
the first calculation of the left and right values. Each attribute for the left and right divisions
and the Ginileft Giniright expressions are calculated as follows:
Li : On the left branch i the sample group (s) number
Ri : i group in the right branch sample (s) number
k : the number of classes
T : node samples
|Tleft| : Left branch sample (s) number
|Tright| : Right branch sample (s) number
be calculated with the following definitions of relations.
k
ileft
i
left T
L
Gini 1
2
1
k
iright
i
right T
R
Gini 1
2
1
The nature of the learning set, for each j the number of elements to be calculated the
following correlation [10].
Ginij =
( |Tleft|Ginileft + |Tright|Giniright )
143
2.3 Experimental Results
Figure 4. Decision Tree for PH analysis
Table 3. ROC analysis for PH results
accuracy: 64.62%
true (7.3-7.5)
true (7.5-7.7)
true (7.7-7.9)
true (>7.9)
class precision
pred. (7.3-7.5)
0
0
0
0
0.00%
pred. (7.5-7.7)
2
10
4
4
50.00%
pred. (7.7-7.9)
1
5
32
7
71.11%
pred. (>7.9)
0
0
0
0
0.00%
class recall
0.00%
66.67%
88.89%
0.00%
144
Decision tree analysis for PH output values class was arranged within the range of 7.3
and 7.9 with the incremental steps 0.2. It’s prediction accuracy is 64.62% in total.
Figure 5. Decision Tree for DBO analysis
Table 4. ROC analysis for DBO results
accuracy: 81.25%
true (0-15)
true (15-30)
true (30-45)
true (>45)
class precision
pred. (15-30)
8
41
2
0
80.39%
pred. (0-15)
11
2
0
0
84.62%
pred. (30-45)
0
0
0
0
0.00%
145
pred. (>45)
0
0
0
0
0.00%
class recall
57.89%
95.35%
0.00%
0.00%
Decision tree analysis for DBO output values class was arranged within the range of 0
and 45 with the incremental steps 15. It’s prediction accuracy is 81.25% in total.
Figure 6. Decision Tree for DQO analysis
Table 5. ROC analysis for DQO results
accuracy: 57.81%
true (0-50)
true (50-75)
true (75-100)
true (>100)
class precision
pred. (0-50)
2
1
1
0
50.00%
pred. (50-75)
3
16
6
1
61.54%
pred. (75-100)
0
2
16
11
55.17%
pred. (>100)
0
0
2
3
60.00%
class recall
40.00%
84.21%
64.00%
20.00%
146
Decision tree analysis for DQO output values class was arranged within the range of 0
and 100 with the incremental steps 25. It’s prediction accuracy is 57.81% in total.
Figure 7. Decision Tree for SS analysis
Table 6. ROC analysis for SS results
accuracy: 65.62%
true (0-10)
true (11-20)
true (21-30)
true (>30)
class precision
pred. (0-10)
1
0
1
0
50.00%
147
pred. (11-20)
1
35
12
6
64.81%
pred. (21-30)
0
2
6
0
75.00%
pred. (>30)
0
0
0
0
0.00%
class recall
50.00%
94.59%
31.58
0.00%
Decision tree analysis for SS output values class was arranged within the range of 0
and 30 with the incremental steps 10. It’s prediction accuracy is 65.62% in total.
3. Conclusions
Developing of a diagnosis tool for controlling a waste water treatment plant is an
interesting contribuiton to the field of intelligent systems when applied to industiral process.
Especially it can be very helpfull to the plant manager.
It is expected that every water treatment plant output effluent quality must fullfill the
waste water quality standart specified in the goverment regulations. Therefore it is important
to predict plant output values and make some preventions and improvements for those output.
Because of this the usage of expert systems for wastewater treatment plant monitoring,
control, diagnosis, assessment is a point of interest for the researchers in domain.
Decision tree accuracy results have average value but not perfect to water treatment
plant. The algorithm we used in this study is decision tree method and it is actually usefull
when there is a clustering problem. Decision tree dataset needs a good representation class in
order to get better results. Although waste water plant datasets are not very good for decision
algorithms they can be arranged by dividing output values as new groups. But there will be
some accuracy problems. This problem can be solved by adding other algorithms so it will
hybrid algorithms and accuracy results reach to maximum.
Future work will consist in adding fuzzy logic, neural networks or artifical immune
system algorithm to the developed expert system, in order to obtain a better expert system for
a wastewater treatment plant effluent assessment. Furthermore, the developed system can be
replenished with a control component for wastewater treatment plant effluent quality control.
REFERENCES
[1] Belanche, L., Valde´s, J. J., Comas, J., Roda, I. R., & Poch, M. (2000). Prediction of the
bulking phenomenon in wastewater treatment plants. Artificial Intelligence in
Engineering, 14(4), 307–317.
[2] Hanbay, D., Turkoglu, I., Demir, Y.(2008), Prediction of wastewater treatment plant
performance based on wavelet packet decomposition and neural networks, Expert
Systems with Applications, vol 43:2, pp:1038-1043.
148
[3] Zhang, R., Hu, X., Effluent Quality Prediction of Wastewater Treatment System Based on
Small-world ANN, JOURNAL OF COMPUTERS, VOL. 7, NO. 9, SEPTEMBER
2012
[4] Mingzhi, H., Jinquan, W., Yongwen, M., Yan,W., Weijiang, L., Xiaofei, S., Control rules
of aeration in a submerged biofilm wastewater treatment process using fuzzy neural
networks, Expert Systems with Applications, vol 36:7, 2009.
[5] Nilsson, S., Stathaki, A., King, R.E., Prediction of Wastewater Pre-Precipitation Variables
Using Self-Organizing Networks, IEEE International Symposium 2005, pp. 932- 937.
[6] Olmo, F.H., Llanes, F.H., Gaudioso,E., An emergent approach for the control of
wastewater treatment plants by means of reinforcement learning techniques, Expert
Systems with Applications,Volume 39, Issue 3, 2012, pp. 2355–2360.
[7] Prasad, K.H., Faruquie, T.A., Joshi, S., Chaturvedi, S., Subramaniam, L.V., Mohania, M.,
(2011), Data Cleansing Techniques for Large Enterprise Datasets, SRII Global
Conference (SRII), pp. 135-144.
[8] Machine Learning Repository (UCI), (2013). Available at
http://archive.ics.uci.edu/ml/datasets/Water+Treatment+Plant
[9] Smith, L., (2002).A tutorial on Principal Components Analysis, Available at
http://www.sccg.sk/~haladova/principal_components.pdf
[10] Oliveira-Esquerre, K.P., Mori, M., Bruns, R.E., (2002). Simulation of an industrial
wastewater treatment using artificial neural networks and principal components
analysis, Brazilian Journal
[11] Civelekoglu, G., (2006). The Modeling of Treatment Processes with Artificial
Intelligence and Multistatistical Methods, Doctorate Thesis, Suleyman Demirel
University, Turkey.
[12] Wikipedia_2 (2013). Correlation and dependence, Available at
http://en.wikipedia.org/wiki/Correlation_and_dependence.
[13] Y. Özkan, Data Mining Methods, Papatya Publishing, Turkey, 2008