Content uploaded by Jean-Luc Bertrand-Krajewski
Author content
All content in this area was uploaded by Jean-Luc Bertrand-Krajewski on Sep 30, 2015
Content may be subject to copyright.
Plazas-Nossa et al.
113
Detection of outliers and replacement of missing values in
absorbance and discharge time series
Leonardo Plazas-Nossa1,3,*, Jean-Luc Bertrand-Krajewski2 and Andrés Torres3
1Universidad Distrital Francisco José de Caldas, Facultad de Ingeniería, Carrera 7 No. 40-53, 110231 Bogotá,
Colombia (Email: lplazasn@udistrital.edu.co)
2University of Lyon, INSA Lyon, LGCIE-DEEP, F-69621 Villeurbanne cedex, France (Email: jean-luc.bertrand-
krajewski@insa-lyon.fr)
3Grupo de Investigación Ciencia e Ingeniería del Agua y el Ambiente, Facultad de Ingeniería, Pontificia Universidad
Javeriana, Carrera 7 No. 40-62, 110231 Bogotá, Colombia (Email: plazas-l@javeriana.edu.co;
andres.torres@javeriana.edu.co).
Abstract
The present paper proposes a methodology aiming to detect and remove outliers as well as to fill
gaps in time series. One notable application of such a methodology would be the comparison of dry
weather contributions to those of rain events, after detection and removal of outliers. The proposed
methodology includes outlier detection and application of the Discrete Fourier Transform (DFT).
Together, these tools were used to analyse a case study including four time series: three UV-Vis
spectra series from study sites in Colombia and one discharge series in France. Outlier detection
with the proposed methodology gives good results when window parameter values are small and
self-similar, despite the fact that the four time series exhibited different lengths and behaviours. The
DFT allows completing time series based on its ability to manage various gap sizes, remove outliers
and replace missing values. DFT led to low error percentages for all four time series (14 % in
average). This percentage reflects what would have likely been the time series behaviour in the
absence of misleading outliers and missing data.
Keywords
Discrete Fourier Transform, Time series analysis, UV-Vis spectrometry, Water quality, Winsorising
INTRODUCTION
Continuous on-line measurements of discharges and UV-Vis absorbance spectra are increasingly
applied techniques for water quality measurement in sewer systems. They help estimate pollutant
concentrations and loads in sewer systems and offer real-time control applicability. In fact, time series
may present outliers or lost values due to obstruction, clogging and failure of the sensors themselves
or due to the periodic sensors maintenance actions. This work proposes a methodology aiming to
detect and remove outliers as well as to fill gaps in time series using Winsorising, DFT (Discrete
Fourier Transform) and IFFT (Inverse Fast Fourier Transform) (Acuña and Rodríguez, 2004; Tukey,
1977). These tools were used to analyse four time series: three UV-Vis spectra series in Colombia
and one discharge series in France. Data pre-processing is thus a necessary pre-requisite [4].
MATERIALS AND METHODS
The spectro::lyserTM UV-Vis submersible probes, deployed at Colombian sites, are equipped with a
xenon lamp of wavelength 200 nm to 750 nm at 2.5 nm intervals that measure light attenuation
(absorbance) (Langergraber et al., 2004; s::can, 2006). In the French site, the discharge in a combined
sewer was measured by means of a NIVUSTM OCM Pro Doppler flow meter. The four time series
are: (i) El Salitre WWTP Bogotá-Col., 5705 records (1/min) from June 29th at 9:03 h to July 3rd at
Poster session Data UDM2015
114
17:33 h, 2011; (ii) Gibraltar Pumping station (GPS) Bogotá-Col., 35684 records (1/min) from Oct.
18th at 16:17 h to Nov. 11th at 11:20 h, 2011; (iii) San Fernando WWTP Medellín-Col., 107204
records (1/each two min) from Sep. 24th at 11:08 h, 2011 to February 20th, 2012 at 10:18 h; and (iv)
Ecully urban catchment Lyon-Fr., 19283 records (1/each two min) from Feb. 27th at 10:38 h to Mar.
26th at 05:22 h, 2007 (Figure 1).
In order to detect and remove outliers, the winsorising transforms the original data and gives more
weight on the central value of each time window by using a mobile window of N values. Such central
values depend on two parameters (i) r the number of values to be modified within the window and
(ii) m the value to set the minimum and maximum values on the window range, which allows the
removal of values beyond the upper or lower limits in order to obtain the winsorised data (Ko and
Lee, 1991; Pearson, 2002). DFT procedure switches the time series from the time domain to the
frequency domain and involves a finite combination of complex sinusoid components; this has the
same number of sample values as the previous time domain. Thus, the DFT is applied to values in
the data set prior to an interval of missing values. In this work, the ten most important harmonics
which reproduce the pattern and dynamic of the events were used (Proakis and Manolakis, 2007;
Plazas-Nossa and Torres, 2013). Finally, the IFFT is utilized to convert complex sinusoids
(harmonics) into a finite number of discrete points, returning to the time domain (Plazas-Nossa and
Torres, 2013). The resulting time series is used to complete an interval of missing values; this is called
“forward step to fill”. However, if the half length of the available values range is lower than the length
of the missing values interval, the procedure tests if the subsequent half length of the available value
range is greater than the length of the gap: in this case, it is used to fill this interval; this is called
“backward step to fill”. But if neither steps of forward and backward filling have the sufficient length,
all the values, including available values and replaced values in previous ranges from the start of the
time series are used to fill the missing values interval; this is called “step using total amount of values
to fill”. Finally, DFT is applied to the resulting time series and the 1% of the most important
harmonics is used.
RESULTS AND DISCUSSION
Winsorising process was applied using several combinations of r and m parameters to remove
outliers. Final parameters values (r and m), obtained graphically by trial and error procedure, are
summarised in Table 1.
Table 1. r and m parameter values used in time series, removed values and MAPE values
Time series
r
m
Removed values
MAPE (%)
El Salitre WWTP
70
65
2100 – 2439
3
GPS
150
130
17500 – 19499
9
San Fernando WWTP
170
170
22500 – 27999
25
Ecully urban catchment
110
90
9000 – 10999
21
The resulting four time series are shown in Figure 1. For Ecully urban catchment, the Winsorising
procedure was applied to remove storm events from the time series, which is a valuable task when an
estimation of the dry weather contribution to pollutant loads measured during storm events is required
(Métadier and Bertrand-Krajewski, 2011).
Plazas-Nossa et al.
115
a)
b)
c)
d)
Figure 1. Resulting time series after applying the Winsorising procedure to El Salitre WWTP (a),
GPS (b), San Fernando WWTP (c), and Ecully urban catchment (d)
a)
b)
c)
d)
Figure 2. Resulting time series for all four study sites after applying the proposed procedure to El
Salitre WWTP (a), GPS (b), San Fernando WWTP (c), and Ecully urban catchment (d)
Poster session Data UDM2015
116
Black and red lines in Figure 2 correspond respectively to original time series (absorbance and
discharge) and resulting time series in the four cases: (i) El Salitre WWTP time series has three large
gaps of values: for the first gap, step of backward filling was used; for the second gap, step of forward
filling was used; finally, for the third gap, the total amount of useful and replaced values were used ;
(ii) GPS time series has three large gaps: for all gaps, step of forward filling was used; (iii) San
Fernando WWTP time series has ten gaps with 341 missing values and two gaps with 1705 missing
values: for all gaps, step of forward filling was used; and (iv) Ecully time series does not have any
missing values, thus, the result may represent dry weather conditions for the period of time during
which the time series was analysed.
Continuous subsets of discharge and absorbance time series with no outliers and no missing values
were intentionally removed from the original time series. The mean absolute percentage of error
(MAPE) was used as performance indicator. Black and red lines in Figure 3 correspond respectively
to original time series (absorbance and discharge) and resulting time series. Overall, a 15 % average
error was obtained (see Table 1 for more detailed results).
a)
b)
c)
d)
Figure 3. Resulting time series for all four study sites after removal a selecting range values and
applying the proposed procedure to El Salitre WWTP (a), GPS (b), San Fernando WWTP (c), and
Ecully urban catchment (d)
CONCLUSIONS
The method was applied to four different time series with different characteristics and consists of
Winsorising as a step in outlier removal. Also, the application of the DFT and the IFFT, using the ten
most important harmonics of useful values, as steps to fill the missing values gaps in the time series
applying – steps of forward, backward and total amount of values to fill. However, to obtain the final
pre-processed time series, DFT and IFFT were applied to the resulting time series and the 1% of the
most important harmonics was used. The resulting r and m parameter values were specific for each
Plazas-Nossa et al.
117
time series because each time series has different behaviour. For Ecully urban catchment the
Winsorising process was applied to remove storm events from the time series and obtain its dry
weather behaviour. Mean absolute error percentage results obtained were between 2.9% and 25%,
the proposed process could be applied to different time series (different sites) with different
characteristics, because is preserved the behaviour pattern from the time series before of the missing
values gap. Overall, a 14 % average error percentage was obtained.
REFERENCES
Acuña, E. and Rodríguez, C. (2004). On Detection of Outliers and Their Effect in Supervised Classification. Department
of Mathematics University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico [online]
http://academic.uprm.edu/~eacuna/vene31.pdf (Accessed October 15, 2013).
Ko, S-J and Lee, Y. (1991). Theoretical analysis of winsorizing smoothers and their applications to image processing.
Acoustics, Speech, and Signal Processing, ICASSP-1991, 4, 3001–3004.
Langergraber, G., Fleischmann, N., Hofstaedter, F. and Weingartner A. (2004). Monitoring of a paper mill waste water
treatment plant using UV/VIS spectroscopy, IWA Water Science and Technology 49(1), 9–14.
Liu, H., Shah, S. and Jiang, W. (2004). On-line outlier detection and data cleaning, Computers and Chemical Engineering
28 (2004) 1635–1647.
Métadier, M. and Bertrand-Krajewski, J-L. (2011). Assessing dry weather flow contribution in TSS and COD storm
events loads in combined sewer systems. Water Science and Technology 63(12), 2983–2991.
Pearson, R. (2002). Outliers in process modelling and identification. IEEE Transactions on Control Systems Technology
10, 55–63.
Plazas-Nossa, L. and Torres, A. (2013) Fourier analysis as a forecasting tool for absorbance time series received by UV-
Vis probes installed on urban sewer systems, Proceedings of 8th International Conference Novatech 2013, Lyon,
France, 23-27 June 2013.
Proakis, J. and Manolakis, D. (2007) Digital signal processing principles, algorithms, and applications, 4th ed. New
Jersey: Pearson Prentice Hall.
s::can, (2006) Manual ana::pro Version 5.3 September 2006 Release, Messtechnik GmbH, Vienna, Austria 2006.
Tukey, J. (1977). Exploratory data analysis. Addison-Wesely, ISBN 0-201-07616-0. 1977.