Conference PaperPDF Available

Detection of outliers and replacement of missing values in absorbance and discharge time series

Authors:

Figures

Content may be subject to copyright.
Plazas-Nossa et al.
113
Detection of outliers and replacement of missing values in
absorbance and discharge time series
Leonardo Plazas-Nossa1,3,*, Jean-Luc Bertrand-Krajewski2 and Andrés Torres3
1Universidad Distrital Francisco José de Caldas, Facultad de Ingeniería, Carrera 7 No. 40-53, 110231 Bogotá,
Colombia (Email: lplazasn@udistrital.edu.co)
2University of Lyon, INSA Lyon, LGCIE-DEEP, F-69621 Villeurbanne cedex, France (Email: jean-luc.bertrand-
krajewski@insa-lyon.fr)
3Grupo de Investigación Ciencia e Ingeniería del Agua y el Ambiente, Facultad de Ingeniería, Pontificia Universidad
Javeriana, Carrera 7 No. 40-62, 110231 Bogotá, Colombia (Email: plazas-l@javeriana.edu.co;
andres.torres@javeriana.edu.co).
Abstract
The present paper proposes a methodology aiming to detect and remove outliers as well as to fill
gaps in time series. One notable application of such a methodology would be the comparison of dry
weather contributions to those of rain events, after detection and removal of outliers. The proposed
methodology includes outlier detection and application of the Discrete Fourier Transform (DFT).
Together, these tools were used to analyse a case study including four time series: three UV-Vis
spectra series from study sites in Colombia and one discharge series in France. Outlier detection
with the proposed methodology gives good results when window parameter values are small and
self-similar, despite the fact that the four time series exhibited different lengths and behaviours. The
DFT allows completing time series based on its ability to manage various gap sizes, remove outliers
and replace missing values. DFT led to low error percentages for all four time series (14 % in
average). This percentage reflects what would have likely been the time series behaviour in the
absence of misleading outliers and missing data.
Keywords
Discrete Fourier Transform, Time series analysis, UV-Vis spectrometry, Water quality, Winsorising
INTRODUCTION
Continuous on-line measurements of discharges and UV-Vis absorbance spectra are increasingly
applied techniques for water quality measurement in sewer systems. They help estimate pollutant
concentrations and loads in sewer systems and offer real-time control applicability. In fact, time series
may present outliers or lost values due to obstruction, clogging and failure of the sensors themselves
or due to the periodic sensors maintenance actions. This work proposes a methodology aiming to
detect and remove outliers as well as to fill gaps in time series using Winsorising, DFT (Discrete
Fourier Transform) and IFFT (Inverse Fast Fourier Transform) (Acuña and Rodríguez, 2004; Tukey,
1977). These tools were used to analyse four time series: three UV-Vis spectra series in Colombia
and one discharge series in France. Data pre-processing is thus a necessary pre-requisite [4].
MATERIALS AND METHODS
The spectro::lyserTM UV-Vis submersible probes, deployed at Colombian sites, are equipped with a
xenon lamp of wavelength 200 nm to 750 nm at 2.5 nm intervals that measure light attenuation
(absorbance) (Langergraber et al., 2004; s::can, 2006). In the French site, the discharge in a combined
sewer was measured by means of a NIVUSTM OCM Pro Doppler flow meter. The four time series
are: (i) El Salitre WWTP Bogotá-Col., 5705 records (1/min) from June 29th at 9:03 h to July 3rd at
Poster session Data UDM2015
114
17:33 h, 2011; (ii) Gibraltar Pumping station (GPS) Bogotá-Col., 35684 records (1/min) from Oct.
18th at 16:17 h to Nov. 11th at 11:20 h, 2011; (iii) San Fernando WWTP Medellín-Col., 107204
records (1/each two min) from Sep. 24th at 11:08 h, 2011 to February 20th, 2012 at 10:18 h; and (iv)
Ecully urban catchment Lyon-Fr., 19283 records (1/each two min) from Feb. 27th at 10:38 h to Mar.
26th at 05:22 h, 2007 (Figure 1).
In order to detect and remove outliers, the winsorising transforms the original data and gives more
weight on the central value of each time window by using a mobile window of N values. Such central
values depend on two parameters (i) r the number of values to be modified within the window and
(ii) m the value to set the minimum and maximum values on the window range, which allows the
removal of values beyond the upper or lower limits in order to obtain the winsorised data (Ko and
Lee, 1991; Pearson, 2002). DFT procedure switches the time series from the time domain to the
frequency domain and involves a finite combination of complex sinusoid components; this has the
same number of sample values as the previous time domain. Thus, the DFT is applied to values in
the data set prior to an interval of missing values. In this work, the ten most important harmonics
which reproduce the pattern and dynamic of the events were used (Proakis and Manolakis, 2007;
Plazas-Nossa and Torres, 2013). Finally, the IFFT is utilized to convert complex sinusoids
(harmonics) into a finite number of discrete points, returning to the time domain (Plazas-Nossa and
Torres, 2013). The resulting time series is used to complete an interval of missing values; this is called
“forward step to fill”. However, if the half length of the available values range is lower than the length
of the missing values interval, the procedure tests if the subsequent half length of the available value
range is greater than the length of the gap: in this case, it is used to fill this interval; this is called
“backward step to fill”. But if neither steps of forward and backward filling have the sufficient length,
all the values, including available values and replaced values in previous ranges from the start of the
time series are used to fill the missing values interval; this is called “step using total amount of values
to fill”. Finally, DFT is applied to the resulting time series and the 1% of the most important
harmonics is used.
RESULTS AND DISCUSSION
Winsorising process was applied using several combinations of r and m parameters to remove
outliers. Final parameters values (r and m), obtained graphically by trial and error procedure, are
summarised in Table 1.
Table 1. r and m parameter values used in time series, removed values and MAPE values
Time series
r
m
Removed values
MAPE (%)
El Salitre WWTP
70
65
2100 2439
3
GPS
150
130
17500 19499
9
San Fernando WWTP
170
170
22500 27999
25
Ecully urban catchment
110
90
9000 10999
21
The resulting four time series are shown in Figure 1. For Ecully urban catchment, the Winsorising
procedure was applied to remove storm events from the time series, which is a valuable task when an
estimation of the dry weather contribution to pollutant loads measured during storm events is required
(Métadier and Bertrand-Krajewski, 2011).
Plazas-Nossa et al.
115
a)
b)
c)
d)
Figure 1. Resulting time series after applying the Winsorising procedure to El Salitre WWTP (a),
GPS (b), San Fernando WWTP (c), and Ecully urban catchment (d)
a)
b)
c)
d)
Figure 2. Resulting time series for all four study sites after applying the proposed procedure to El
Salitre WWTP (a), GPS (b), San Fernando WWTP (c), and Ecully urban catchment (d)
Poster session Data UDM2015
116
Black and red lines in Figure 2 correspond respectively to original time series (absorbance and
discharge) and resulting time series in the four cases: (i) El Salitre WWTP time series has three large
gaps of values: for the first gap, step of backward filling was used; for the second gap, step of forward
filling was used; finally, for the third gap, the total amount of useful and replaced values were used ;
(ii) GPS time series has three large gaps: for all gaps, step of forward filling was used; (iii) San
Fernando WWTP time series has ten gaps with 341 missing values and two gaps with 1705 missing
values: for all gaps, step of forward filling was used; and (iv) Ecully time series does not have any
missing values, thus, the result may represent dry weather conditions for the period of time during
which the time series was analysed.
Continuous subsets of discharge and absorbance time series with no outliers and no missing values
were intentionally removed from the original time series. The mean absolute percentage of error
(MAPE) was used as performance indicator. Black and red lines in Figure 3 correspond respectively
to original time series (absorbance and discharge) and resulting time series. Overall, a 15 % average
error was obtained (see Table 1 for more detailed results).
a)
b)
c)
d)
Figure 3. Resulting time series for all four study sites after removal a selecting range values and
applying the proposed procedure to El Salitre WWTP (a), GPS (b), San Fernando WWTP (c), and
Ecully urban catchment (d)
CONCLUSIONS
The method was applied to four different time series with different characteristics and consists of
Winsorising as a step in outlier removal. Also, the application of the DFT and the IFFT, using the ten
most important harmonics of useful values, as steps to fill the missing values gaps in the time series
applying steps of forward, backward and total amount of values to fill. However, to obtain the final
pre-processed time series, DFT and IFFT were applied to the resulting time series and the 1% of the
most important harmonics was used. The resulting r and m parameter values were specific for each
Plazas-Nossa et al.
117
time series because each time series has different behaviour. For Ecully urban catchment the
Winsorising process was applied to remove storm events from the time series and obtain its dry
weather behaviour. Mean absolute error percentage results obtained were between 2.9% and 25%,
the proposed process could be applied to different time series (different sites) with different
characteristics, because is preserved the behaviour pattern from the time series before of the missing
values gap. Overall, a 14 % average error percentage was obtained.
REFERENCES
Acuña, E. and Rodríguez, C. (2004). On Detection of Outliers and Their Effect in Supervised Classification. Department
of Mathematics University of Puerto Rico at Mayaguez, Mayaguez, Puerto Rico [online]
http://academic.uprm.edu/~eacuna/vene31.pdf (Accessed October 15, 2013).
Ko, S-J and Lee, Y. (1991). Theoretical analysis of winsorizing smoothers and their applications to image processing.
Acoustics, Speech, and Signal Processing, ICASSP-1991, 4, 30013004.
Langergraber, G., Fleischmann, N., Hofstaedter, F. and Weingartner A. (2004). Monitoring of a paper mill waste water
treatment plant using UV/VIS spectroscopy, IWA Water Science and Technology 49(1), 914.
Liu, H., Shah, S. and Jiang, W. (2004). On-line outlier detection and data cleaning, Computers and Chemical Engineering
28 (2004) 16351647.
Métadier, M. and Bertrand-Krajewski, J-L. (2011). Assessing dry weather flow contribution in TSS and COD storm
events loads in combined sewer systems. Water Science and Technology 63(12), 29832991.
Pearson, R. (2002). Outliers in process modelling and identification. IEEE Transactions on Control Systems Technology
10, 5563.
Plazas-Nossa, L. and Torres, A. (2013) Fourier analysis as a forecasting tool for absorbance time series received by UV-
Vis probes installed on urban sewer systems, Proceedings of 8th International Conference Novatech 2013, Lyon,
France, 23-27 June 2013.
Proakis, J. and Manolakis, D. (2007) Digital signal processing principles, algorithms, and applications, 4th ed. New
Jersey: Pearson Prentice Hall.
s::can, (2006) Manual ana::pro Version 5.3 September 2006 Release, Messtechnik GmbH, Vienna, Austria 2006.
Tukey, J. (1977). Exploratory data analysis. Addison-Wesely, ISBN 0-201-07616-0. 1977.
... Therefore, this step is called the "backward step to fill (BSF)." But if neither steps of the forward and backward filling have enough length, the total values, including available values and replaced values in previous ranges, from the start of the time series are used to fill the missing values gap; this is called "total amount of values step to fill (TSF)" (Plazas-Nossa et al., 2015). ...
Article
Full-text available
The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables and the result that will be applied to analysis and simulation models that feed variables such as flow and level of a study area. One of the most common problems is the acquisition and transmission of data from weather stations due to atypical values and lost data; this generates difficulties in the simulation process. Consequently, it is necessary to propose a numerical strategy to solve this problem. The data source for this study is a real database where these problems are presented with different variables of weather. This study is based on comparing three methods of time series analysis to evaluate a multivariable process offline. For the development of the study, we applied a method based on the discrete Fourier transform (DFT), and we contrasted it with methods such as the average and linear regression without uncertainty parameters to complete missing data. The proposed methodology entails statistical values, outlier detection, and the application of the DFT. The application of DFT allows the time series completion, based on its ability to manage various gap sizes and replace missing values. In sum, DFT led to low error percentages for all the time series (1% average). This percentage reflects what would have likely been the shape or pattern of the time series behavior in the absence of misleading outliers and missing data.
... La détection de défauts et la validation de données hydrométriques (pollution, mais surtout débit) mesurées en continu font l'objet de nombreux travaux de ÉTUDE 40 | TSM numéro 11 -2022 -117 e année [BRANI-SAVLJEVIĆ et al., 2011 ;PLAZAS-NOSSA et al., 2015 ;ALFERES et VANROLLEGHEM, 2016 ;CASTRO et BERTRAND-KRAJEWSKI, 2017]. Ils mettent en oeuvre des techniques statistiques plus ou moins élaborées, y compris des méthodes de prédiction (filtres de Kalman, Arma) ou des modèles à base physique permettant de créer et d'exploiter une redondance virtuelle [BENNIS et al., 2000 ;PIATYSZEK et al., 2000 ;ALFERES et al., 2013]. ...
Article
Pour lutter contre la pollution des masses d’eaux, la réglementation française impose la mesure et la régulation des rejets d’eaux usées dans l’environnement. Cependant, malgré les progrès dans le domaine des systèmes d’acquisition de données, les capteurs, tout particulièrement les sondes de turbidité, installés dans des milieux agressifs tels que les réseaux d’assainissement sont sujets à des dysfonctionnements fréquents (dérive, saturation, données manquantes…), qui peuvent fausser l’évaluation du flux de pollution. Il est donc essentiel d’identifier les potentielles anomalies avant toute utilisation des données. Aujourd’hui, cette validation se fait au niveau de la supervision et/ou via des opérateurs. L’objectif de ce travail est d’évaluer le potentiel des outils d’intelligence artificielle à automatiser la validation et d’estimer la plus-value de cette approche par rapport à une validation « métier » effectuée par un expert. Pour cela, quatre algorithmes de détection d’anomalies de l’état de l’art sont comparés en utilisant des données de turbidité issues du réseau de collecte de Saint-Malo Agglomération. La plupart de ces algorithmes ne sont pas adaptés à la nature des données étudiées qui sont hétérogènes et bruitées. Seul l’algorithme Matrix Profile permet d’obtenir des résultats prometteurs avec une majorité d’anomalies détectées et un nombre de faux positifs relativement limités.
... [26] applies an unsupervised neural network algorithm called Kohonen Self-Organizing Maps (KSOM) replacing outliers in sludge wastewater treatment plants. [27]proposes an approach for detecting, removing and fill the outliers in a time series data using Winsorising, Discrete Fourier Transform(DFT) and Inverse Fourier Transform (IFT). A method for dealing with missing values in heterogeneous data using k-Nearest Neighbors is introduced [28]. ...
Article
Full-text available
The efficiency of any load forecasting mechanism depends on the quality and distribution characteristics of the training data. Outliers and missing values are the primary concern, especially in developing countries’ load data. Several research works have proposed the models for the imputation process to deal with outliers before forecasting. However, the efficiency of these approaches is compromised when it comes to data that falls into a random-walk distribution. Thus, this study aims to develop an efficient data cleansing model that accounts for a random-walk distribution by extending the Multivariate Nonlinear Regression (MNLR) method. The k-mean algorithm is used to detect and analyze the size of an outlier in the data. Twenty-minutes interval load data from 2015 to 2019 collected at Kinondoni-North (at Mikocheni distribution network in Dar es salaam) is used in this study. After analyzing the data for outliers, the empirical results detect the presence of outliers by 5.17852% (which is 5207 out of 105192). Finally, the extended-MNLR (e-MNLR) model achieves promising results over the ANN, SVM, Miss Forest, MICE, and KNN algorithms by attaining 2.109137, 1.956039, and 7.787976 values of RMSE, MAE, and MAPE, respectively.
Preprint
Full-text available
The time data series of weather stations are a source of information for floods. The study of the previous wintertime series allows knowing the behavior of the variables and the result that will be applied to analysis and simulation models that feed variables such as flow and level of a study area. One of the most common problems is the acquisition and transmission of data from weather stations due to atypical values and lost data, this generates difficulties in the simulation process. Consequently, it is necessary to propose a numerical strategy to solve this problem. The data source for this study is a real database where these problems are presented with different variables of weather. This study is based on comparing three methods of time series analysis to evaluate a multivariable process offline. For the development of the study, we applied a method based on the Discrete Fourier Transform (DFT) and we contrasted it with methods such as the average and linear regression without uncertainty parameters to complete missing data. The proposed methodology entails statistical values, outlier detection and the application of the DFT. The application of DFT allows the time series completion, based on its ability to manage various gap sizes and replace missing values. In sum, DFT led to low error percentages for all the time series (1% average). This percentage reflects what would have likely been the shape or pattern of the time series behavior in the absence of misleading outliers and missing data.
Technical Report
Full-text available
Synthèse des résultats scientifiques de l'observatoire sur 2013-2016 QUELQUES CLES DE LECTURE DU DOCUMENT Le rapport est divisé en 5 PARTIES. 1- Présentation du dispositif d’observation, la démarche scientifique associée et le mode actuel de gestion des données. 2- Résultats de recherche obtenus à partir des observations menées au sein de l’OTHU. 3- Le bilan sur la période 2013-2016 et les perspectives scientifiques 4- Actions de valorisation et transfert des résultats OTHU 5- Enfin, la dernière partie concerne la structuration de l’OTHU en terme pratique, son rayonnement local, régional, national et international, ainsi que son évolution
Conference Paper
Full-text available
An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism (Hawkins, 1980). Outlier detection has many applications, such as data cleaning, fraud detection and network intrusion. The existence of outliers can indicate individuals or groups that have behavior very different from the most of the individuals of the dataset. Frequently, outliers are removed to improve accuracy of the estimators. But sometimes the presence of an outlier has a certain meaning, which explanation can be lost if the outlier is deleted. In this paper we compare detection outlier techniques based on statistical measures, clustering methods and data mining methods. In particular we compare detection of outliers using robust estimators of the center and the covariance matrix for the Mahalanobis distance, detection of outliers using partitioning around medoids (PAM), and two data mining techniques to detect outliers: Bay's algorithm for distance-based outliers (Bay and Schwabacher, 2003) and the LOF a density-based local outlier algorithm (Breuning et al., 2000). The effect of the presence of outliers on the performance of three well-known classifiers is discussed.
Article
Full-text available
Continuous high resolution long term turbidity measurements along with continuous discharge measurements are now recognised as an appropriate technique for the estimation of in sewer total suspended solids (TSS) and Chemical Oxygen Demand (COD) loads during storm events. In the combined system of the Ecully urban catchment (Lyon, France), this technique is implemented since 2003, with more than 200 storm events monitored. This paper presents a method for the estimation of the dry weather (DW) contribution to measured total TSS and COD event loads with special attention devoted to uncertainties assessment. The method accounts for the dynamics of both discharge and turbidity time series at two minutes time step. The study is based on 180 DW days monitored in 2007-2008. Three distinct classes of DW days were evidenced. Variability analysis and quantification showed that no seasonal effect and no trend over the year were detectable. The law of propagation of uncertainties is applicable for uncertainties estimation. The method has then been applied to all measured storm events. This study confirms the interest of long term continuous discharge and turbidity time series in sewer systems, especially in the perspective of wet weather quality modelling.
Article
Full-text available
A submersible UV/VIS spectrometer was used to monitor a paper mill wastewater treatment plant. It utilises the UV/VIS range (200-750 nm) for simultaneous measurement of COD, filtered COD, TSS and nitrate with just a single instrument. The instrument measures in-situ, directly in the process. Paper mill wastewater shows typical and reproducible spectra at various process measuring points. There is a relative maximum at 280 mm due to the absorbance by dissolved organic substances, mainly ligninic acids. Comparison of absorbance spectra distinctly shows the decrease of this peak, indicating biological degradation throughout the treatment process. Summarising, one can say that paper mill wastewater cannot be monitored by a simple UV probe measuring only the absorbance at a single wavelength. The required information can only be gained from the whole spectra. Regarding plant control it is suggested that only the overall spectral information is used. Calibrations to conventional parameters are now merely carried out for purposes of reference-checking.
Conference Paper
The Winsorizing smoother (W smoother), which is a center weighted median (CWM) filter giving more weight only to the central value of each window, is studied. This filter can preserve image details while suppressing additive white and/or impulsive-type noise. The statistical properties of the W smoother are analyzed. It is shown that the W smoother can outperform the median filter, while its implementation is almost as simple as median filtering. Some relationships between W smoothers and other median-type filters, such as the weighted median filter and the multi-stage median filter, are derived
Article
Outliers are observations that do not follow the statistical distribution of the bulk of the data, and consequently may lead to erroneous results with respect to statistical analysis. Many conventional outlier detection tools are based on the assumption that the data is identically and independently distributed. In this paper, an outlier-resistant data filter-cleaner is proposed. The proposed data filter-cleaner includes an on-line outlier-resistant estimate of the process model and combines it with a modified Kalman filter to detect and “clean” outliers. The advantage over existing methods is that the proposed method has the following features: (a) a priori knowledge of the process model is not required; (b) it is applicable to autocorrelated data; (c) it can be implemented on-line; and (d) it tries to only clean (i.e., detects and replaces) outliers and preserves all other information in the data.
Article
Contenido: Introducción; Señales y sistemas en tiempo discreto; La transformada z y sus aplicaciones en el análisis de sistemas LTI; Análisis frecuencial de señales y sistemas; La transformada de Fourier discreta: sus propiedades y aplicaciones; Cálculo eficiente de la DFT: algoritmos para la transformada rápida de Fourier; Implementación de sistemas en tiempo discreto; Diseño de filtros digitales; Muestreo y reconstrucción de señales; Proceso digital de tasa múltiple; Predicción lineal y filtros lineales óptimos; Estimación espectral de potencia; Apéndices.
Article
Model-based control strategies like model predictive control (MPC) require models of process dynamics accurate enough that the resulting controllers perform adequately in practice. Often, these models are obtained by fitting convenient model structures (e.g., linear finite impulse response (FIR) models, linear pole-zero models, nonlinear Hammerstein or Wiener models, etc.) to observed input-output data. Real measurement data records frequently contain "outliers" or "anomalous data points," which can badly degrade the results of an otherwise reasonable empirical model identification procedure. This paper considers some real datasets containing outliers, examines the influence of outliers on linear and nonlinear system identification, and discusses the problems of outlier detection and data cleaning. Although no single strategy is universally applicable, the Hampel filter described here is often extremely effective in practice
Digital signal processing principles, algorithms, and applications, 4th ed. New Jersey: Pearson Prentice Hall. s::can, (2006) Manual ana::pro Version 5
  • J Proakis
  • D Manolakis
Proakis, J. and Manolakis, D. (2007) Digital signal processing principles, algorithms, and applications, 4th ed. New Jersey: Pearson Prentice Hall. s::can, (2006) Manual ana::pro Version 5.3 September 2006 Release, Messtechnik GmbH, Vienna, Austria 2006.