ArticlePDF Available

Ensemble machine learning using hydrometeorological information to improve modeling of quality parameter of raw water supplying treatment plants

Authors:

Abstract

Source and raw water quality may deteriorate due to rainfall and river flow events that occur in watersheds. The effects on raw water quality are normally detected in drinking water treatment plants (DWTPs) with a time-lag after these events in the watersheds. Early warning systems (EWSs) in DWTPs require models with high accuracy in order to anticipate changes in raw water quality parameters. Ensemble machine learning (EML) techniques have recently been used for water quality modeling to improve accuracy and decrease variance in the outcomes. We used three decision-tree-based EML models (random forest [RF], gradient boosting [GB], and eXtreme Gradient Boosting [XGB]) to predict two critical parameters for DWTPs, raw water Turbidity and UV absorbance (UV254), using rainfall and river flow time series as predictors. When modeling raw water turbidity, the three EML models showed very good performance metrics. For raw water UV254, the three models again showed very good performance metrics. Results from this study suggest that EML approaches could be used in EWSs to anticipate changes in the quality parameters of raw water and enhance decision-making in DWTPs.
Journal of Environmental Management 362 (2024) 121378
0301-4797/© 2024 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-
nc/4.0/).
Research article
Ensemble machine learning using hydrometeorological information to
improve modeling of quality parameter of raw water supplying
treatment plants
Christian Ortiz-Lopez
a
,
*
, Christian Bouchard
a
, Manuel J. Rodriguez
b
a
Centre de Recherche en Am´
enagement et D´
eveloppement (CRAD), Universit´
e Laval, 2325 All´
ee des Biblioth`
eques, Qu´
ebec City, QC, G1V 0A6, Canada
b
´
Ecole Sup´
erieure dAm´
enagement du Territoire et de D´
eveloppement R´
egional (ESAD), Universit´
e Laval, 2325 All´
ee des Biblioth`
eques, Qu´
ebec City, QC, G1V 0A6,
Canada
ARTICLE INFO
Handling editor: Lixiao Zhang
Keywords:
Ensemble machine learning
Drinking water
Source water
Raw water quality modeling
River ow events
Rainfall events
ABSTRACT
Source and raw water quality may deteriorate due to rainfall and river ow events that occur in watersheds. The
effects on raw water quality are normally detected in drinking water treatment plants (DWTPs) with a time-lag
after these events in the watersheds. Early warning systems (EWSs) in DWTPs require models with high accuracy
in order to anticipate changes in raw water quality parameters. Ensemble machine learning (EML) techniques
have recently been used for water quality modeling to improve accuracy and decrease variance in the outcomes.
We used three decision-tree-based EML models (random forest [RF], gradient boosting [GB], and eXtreme
Gradient Boosting [XGB]) to predict two critical parameters for DWTPs, raw water Turbidity and UV absorbance
(UV254), using rainfall and river ow time series as predictors. When modeling raw water turbidity, the three
EML models (r2
RFTu =0.87, r2
GBTu =0.80 and r2
XGBTu =0.81) showed very good performance metrics. For raw
water UV254, the three models (r2
RFUV =0.89, r2
GBUV =0.85 and r2
XGBUV =0.88) again showed very good
performance metrics. Results from this study suggest that EML approaches could be used in EWSs to anticipate
changes in the quality parameters of raw water and enhance decision-making in DWTPs.
1. Introduction
Surface water is a primary source of water supply water used for
human consumption. Surface water contains elements such as patho-
genic microorganisms, particles, and organic matter. These elements
must be removed or inactivated in drinking water treatment plants
(DWTP) to deliver safe drinking water and ensure public health safety
(World Health Organization, 2017). The quality of raw water from
surface sources, such as rivers and lakes, is prone to variation due to
meteorological events that occur in the watersheds (Khan et al., 2015).
During precipitation events, the hydrological response of the watershed
takes some time to develop. This response depends on specic charac-
teristics of rainfall events such as intensity, duration, and amount, as
well as soil characteristics such as soil saturation and soil conditions
preceding the event. Rainfall can lead to a deterioration in the quality of
raw water due to the increased transport of contaminants through sur-
face and subsurface runoff to rivers and lakes, and thus to DWTP intakes
(Delpla et al., 2023). Such deterioration in raw water quality, especially
when there are large peaks in contaminant concentrations, may require
prompt adjustments to treatment conditions in the DWTP. For example,
coagulant and disinfectant dosages may need to be modied after
rainfall events if there are signicant increases in the concentrations of
ne particles and natural organic matter (Edzwald, 2011).
There is a time lag between the moment rain falls in the watershed
and the moment when variations in raw water quality can be detected in
the DWTP through online monitoring or grab sampling (Ortiz-Lopez
et al., 2023). There is an additional time lag between the detection of
water quality degradation and the implementation of operational ad-
justments needed to respond to these situations. A tool that could
anticipate raw water degradation, especially the peak concentrations,
would help DWTP operators react in an appropriate and timely way.
Early Warning Systems (EWSs) are critical for DWTP and include pre-
dictive models. Deterministic (i.e., physics-based) modeling of raw
water quality variations, is difcult due to complex and numerous un-
derlying phenomena (Bui et al., 2020). Some researchers therefore opt
for an empirical (i.e., non-physical) approach using articial intelligence
* Corresponding author.
E-mail address: christian.ortiz-lopez.1@ulaval.ca (C. Ortiz-Lopez).
Contents lists available at ScienceDirect
Journal of Environmental Management
journal homepage: www.elsevier.com/locate/jenvman
https://doi.org/10.1016/j.jenvman.2024.121378
Received 13 February 2024; Received in revised form 3 May 2024; Accepted 2 June 2024
... Previous studies have shown that this method can improve the optimization of resource allocation and accelerate the implementation of data-based policies in waste management systems [13]. With cluster-based analysis, each region can implement strategies that are more appropriate to its characteristics and increase efficiency in the waste recycling process [14]. This study aims to develop an environmentally-based waste management optimization strategy in Indonesia using Machine Learning, by implementing K-Means Clustering in the classification of waste management units, it is expected to increase the effectiveness of the waste management system, support data-based decision making, and create a more efficient and sustainable waste management system. ...
... Classification and clustering algorithms have been applied in identifying waste production patterns and anomaly detection in waste management systems [11]. Integration of big data with Machine Learning enables more accurate trend analysis, supporting data-driven decision making in waste management [14]. In addition, the combination of machine learning and sensor technology has improved automated monitoring systems, accelerated waste pattern detection, and improved waste management efficiency [8]. ...
Article
Full-text available
Introduction: Waste management is one of the biggest environmental challenges facing Indonesia today. With a population of over 270 million people spread across 34 provinces, the country produces a significant amount of waste every day. Diversity in population density, economic activity, and urban development across provinces results in variations in waste production patterns and composition. Effective waste management is critical not only for environmental sustainability, but also for public health and economic development. However, the lack of waste management strategies tailored to the unique characteristics of each province often results in inefficient waste handling and disposal. Objectives: To classify Indonesian provinces into different clusters based on waste production, volume, and composition using machine learning algorithms. Analyze the characteristics of each cluster to understand the unique challenges and opportunities in waste management they face. Provide recommendations for waste management policies tailored to the specific needs of each cluster. Methods: The study began with data collection from official sources such as the Ministry of Environment and Forestry or the Central Statistics Agency. After the data was collected, data preprocessing was carried out to clean the data from missing values and outliers, and to normalize the data so that all variables have the same scale. Next, a clustering algorithm such as K-Means was chosen to group provinces based on their waste characteristics. The optimal number of clusters was determined using the Elbow Method Results: The clustering results divided the provinces into several groups. Cluster 1 contains provinces with relatively low to medium daily waste volumes. Cluster 2 includes provinces with medium waste volumes, while Cluster 3 consists of provinces with very high waste production. The majority of provinces are in clusters 1 and 2, indicating that only a few regions have major problems in waste management. Conclusions: This study shows that clustering with K-Means can help understand waste production patterns in various provinces in Indonesia. Provinces with similar waste characteristics are grouped into three main clusters, with the majority in the low to medium waste volume category. It was found that organic waste is more dominant than inorganic waste, especially in areas with high waste production. This shows that waste management strategies based on recycling, composting, and renewable energy can be effective solutions. The results of this clustering can be used as a basis for designing more appropriate waste management policies, both in increasing waste processing capacity and encouraging community participation in reducing waste.
Article
Full-text available
Rainfall and increased river flow can deteriorate raw water (RW) quality parameters such as turbidity and UV absorbance at 254 nm. This study aims to develop a methodology for integrating both time-lagged watershed rainfall and river flow data into machine learning models of the quality of RW supplying a drinking water treatment plant (DWTP). Spearman's rank non-parametric cross-correlation analyses were performed using both river flow and rain in the watershed and RW data from the water intake. Then, RW turbidity and RW UV254 were modelled, using a support vector regression (SVR) and an artificial neural network (ANN) under several prediction scenarios with time-lagged variables. River flow presented a very strong correlation with RW quality, whereas rainfall showed a moderate correlation. Time lags with maximum correlations between flow data and turbidity were a few hours, while for UV254, they were between 2 and 4 days, demonstrating varied time lags and a complex behaviour. The best performing scenario was the one that used time-lagged watershed rainfall and river flow as input data. ANN performed better for both turbidity and UV254 than SVR. Results from this study suggest the possibility for new modelling strategies and more accurate chemical dosing for the removal of key contaminants.
Article
Full-text available
Study region: Bisham Qilla and Doyian stations, Indus River Basin of Pakistan Study focus: Water pollution is an international concern that impedes human health, ecological sustainability, and agricultural output. This study focuses on the distinguishing characteristics of an evolutionary and ensemble machine learning (ML) based modeling to provide an in-depth insight of escalating water quality problems. The 360 temporal readings of electric conductivity (EC) and total dissolved solids (TDS) with several input variables are used to establish multi-expression programing (MEP) model and random forest (RF) regression model for the assessment of water quality at Indus River. New hydrological insight for the region: The developed models were evaluated using several statistical metrics. The findings reveal that the determination coefficient (R2) in the testing phase (subject to unseen data) for the all the developed models is more than 0.95, indicating the accurateness of the developed models. Furthermore, the error measurements are much lesser with root mean square logarithmic error (RMSLE) nearly equals to zero for each developed model. The mean absolute percent error (MAPE) of MEP models and RF models falls below 10% and 5%, respectively, in all three phases (training, validation and testing). According to the sensitivity study of generated MEP models about the relevance of inputs on the predicted EC and TDS, shows that bi-carbonates and chlorine content have significant influence with a sensitiveness score more than 0.90, whereas the impact of sodium content is less pronounced. All the models (RF and MEP) have lower uncertainty based on the prediction interval coverage probability (PICP) calculated using the quartile regression (QR) approach. The PICP% of each model is greater than 85% in all three stages. Thus, the findings of the study indicate that developing intelligent models for water quality parameter is cost effective and feasible for monitoring and analyzing the Indus River water quality.
Article
Full-text available
The energetic nature of these important water resources makes them the most vulnerable to contamination from additional waste from multiple sources. Water quality monitoring is critical to water environmental management, and successful monitoring provides direction and confirms the effectiveness of water management. Models based on artificial intelligence are fundamental for anticipating appropriate moderation measures for surface water quality. In any case, it remains a challenge and requires a requirement to improve display accuracy. Faster and cheaper control is required due to the real-world impact of low water quality. With this inspiration, this research examines an array of machine-learning calculations to estimate water quality. The proposed approach uses Random Forest for modeling and is also useful for predicting surface water quality in the Kulik geographic region of West Bengal, India. It is a good tool for assessing the quality and ensuring the safe use of drinking water. Various water quality parameters (iron, fluoride, total coliform, fecal coliform, pH, total dissolved solids, magnesium, alkalinity, chloride, total hardness, nitrate, calcium, and Escherichia coli) were measured seasonally (winter, summer, rain) over 10 years (2010–2019). The estimated water quality parameters in this study were total dissolved solids (TDS), pH, and iron. HIGHLIGHTS Most of the north-Bengal people are depend on Kulik River for multiple purposes like settlement, cultivation, irrigation, fishing and various primary activities, so there is a need for water quality monitoring and management of Kulik River.; Analysis and prediction of 13 parameters will be helpful for society.; The proposed approach used Random Forest for modeling and assessing the water quality.;
Article
Full-text available
Modelling source water quality in drinking water treatment systems could be useful for anticipating changes in specific raw water quality parameters. Those changes entail adjustments in drinking water treatment plant (DWTP) operations. Artificial intelligence (AI) has been used for modelling water quality for different purposes and has yielded reliable results. However, there has not yet been wide investigation of raw water quality modelling for treatment purposes using AI. For the first time, in this critical review, we analyzed AI models founded on machine learning techniques that are used for surface water quality modelling and which could be applied in the domain of source water treatment. In a novel approach, we convened an expert panel that helped us define the appropriate criteria for use in the selection of the papers for review. We analysed the selected papers according to several criteria, including the feasibility of input data generation, the modelled data applicability and the benefits and limitations. We evaluated whether the selected models could be applied to forecast raw water quality as decision support systems (DSS) in drinking water treatment. The highest rated were turbidity hourly models based on Support Vector Machines (SVM), as well as daily turbidity and pH models based on Artificial Neural Networks (ANN). We found there is a shortage of models used to specifically estimate raw water quality, which could assist in DSS at DWTPs. There should be an increased effort to model raw water quality, especially with AI models using hourly and sub-hourly time step.
Article
Machine learning (ML) is increasingly used in environmental research to process large data sets and decipher complex relationships between system variables. However, due to the lack of familiarity and methodological rigor, inadequate ML studies may lead to spurious conclusions. In this study, we synthesized literature analysis with our own experience and provided a tutorial-like compilation of common pitfalls along with best practice guidelines for environmental ML research. We identified more than 30 key items and provided evidence-based data analysis based on 148 highly cited research articles to exhibit the misconceptions of terminologies, proper sample size and feature size, data enrichment and feature selection, randomness assessment, data leakage management, data splitting, method selection and comparison, model optimization and evaluation, and model explainability and causality. By analyzing good examples on supervised learning and reference modeling paradigms, we hope to help researchers adopt more rigorous data preprocessing and model development standards for more accurate, robust, and practicable model uses in environmental research and applications.
Article
Heavy rainfall events can lead to the runoff of large amounts of dissolved and particulate matter into surface water sources that may represents challenges for drinking water treatment, such as membrane fouling, increases in chemical demands, and formation of various disinfection by products (DBPs) after disinfection, such as trihalomethanes (THM) and haloacetic acids (HAA). In this study, a framework is defined for analyzing water quality data in relation to climatic variables (rainfalls). The effects of 22 different rain events were assessed on an organic matter proxy (UV absorbance), and on different key water quality parameters for the coagulation step in a drinking water treatment plant. Extended impacts of rewetting events after long term dry period on source water quality were identified, with significant increases in raw water UV 254 nm that last almost 3 weeks. A significant effect on filtered water quality was also noticed and the potential impacts on finished waters quality was confirmed by HAA modelling results. Future studies could focus on the monitoring and modelling of other regulated DBPs such as THM as well as simulations of different scenarios of climate change to estimate the variability of DBPs and its precursors such as organic matter.
Article
One of the most crucial jobs to improve water resources management plans is the assessment of river water quality. A water quality index (WQI) takes multiple water quality factors into account simultaneously. Traditionally, derivations of sub-indices for WQI computations take a long time and are frequently rife with errors. The adoption of reliable and effective machine learning (ML) algorithms has become essential for predicting the WQI of such a matrix. This study predicts WQI, i.e., total dissolved solids (TDS) and electrical conductivity (EC), using ML techniques, including individual learners in conjunction with ensemble learners (bagging and boosting). Anaconda (Python) is utilized to accomplish this. Weak ensemble learners are incorporated to create a strong ensemble learner using an adaptive boosting technique, ensemble learner bagging, and random forest (RF) as a modified bagging method. The ensemble learners are employed on weak or individual learners, which include multi-layer perceptron neural networks (MLPNN), support vector machines (SVM), and decision trees (DT) using regression. The data comprises 372 data readings collected on a monthly basis and eight characteristics to forecast the results. Twenty boosting and bagging sub-models were trained on the collected data readings, and they were then optimized to produce the highest R². Additionally, K-Fold cross-validation with R², RMSE, and MAE is used to validate the testing data. Furthermore, a statistical model performance index is used to compare ensemble models to individual ones (e.g., MAE, RMSE, NSE, MSE, and RMLSE). The outcome revealed that using the boosting and bagging learners improves the response of individual models. RF, with an R² of 0.958 and 0.964 (TDS and EC), and DT, with bagging having an R² of 0.954 and 0.961 (TDS and EC), reported the fewest errors and provided the most reliable and precise performance of the models. In general, the ML ensemble model would improve the performance of models.
Article
The accurate estimation of coastal water quality parameters (WQPs) is crucial for decision-makers to manage water resources. Although various machine learning (ML) models have been developed for coastal water quality estimation using remote sensing data, the performance of these models has significant uncertainties when applied to regional scales. To address this issue, an ensemble ML-based model was developed in this study. The ensemble ML model was applied to estimate chlorophyll-a (Chla), turbidity, and dissolved oxygen (DO) based on Sentinel-2 satellite images in Shenzhen Bay, China. The optimal input features for each WQP were selected from eight spectral bands and seven spectral indices. A local explanation strategy termed Shapley Additive Explanations (SHAP) was employed to quantify contributions of each feature to model outputs. In addition, the impacts of three climate factors on the variation of each WQP were analyzed. The results suggested that the ensemble ML models have satisfied performance for Chla (errors = 1.7%), turbidity (errors = 1.5%) and DO estimation (errors = 0.02%). Band 3 (B3) has the highest positive contribution to Chla estimation, while Band Ration Index2 (BR2) has the highest negative contribution to turbidity estimation, and Band 7 (B7) has the highest positive contribution to DO estimation. The spatial patterns of the three WQPs revealed that the water quality deterioration in Shenzhen Bay was mainly influenced by input of terrestrial pollutants from the estuary. Correlation analysis demonstrated that air temperature (Temp) and average air pressure (AAP) exhibited the closest relationship with Chla. DO showed the strongest negative correlation with Temp, while turbidity was not sensitive to Temp, average wind speed (AWS), and AAP. Overall, the ensemble ML model proposed in this study provides an accurate and practical method for long-term Chla, turbidity, and DO estimation in coastal waters.
Article
Algal bloom is a significant issue when managing water quality in freshwater; specifically, predicting the concentration of algae is essential to maintaining the safety of the drinking water supply system. The chlorophyll-a (Chl-a) concentration is a commonly used indicator to obtain an estimation of algal concentration. In this study, an XGBoost ensemble machine learning (ML) model was developed from eighteen input variables to predict Chl-a concentration. The composition and pretreatment of input variables to the model are important factors for improving model performance. Explainable artificial intelligence (XAI) is an emerging area of ML modeling that provides a reasonable interpretation of model performance. The effect of input variable selection on model performance was estimated, where the priority of input variable selection was determined using three indices: Shapley value (SHAP), feature importance (FI), and variance inflation factor (VIF). SHAP analysis is an XAI algorithm designed to compute the relative importance of input variables with consistency, providing an interpretable analysis for model prediction. The XGB models simulated with independent variables selected using three indices were evaluated with root mean square error (RMSE), RMSE-observation standard deviation ratio, and Nash-Sutcliffe efficiency. This study shows that the model exhibited the most stable performance when the priority of input variables was determined by SHAP. This implies that on-site monitoring can be designed to collect the selected input variables from the SHAP analysis to reduce the cost of overall water quality analysis. The independent variables were further analyzed using SHAP summary plot, force plot, target plot, and partial dependency plot to provide understandable interpretation on the performance of the XGB model. While XAI is still in the early stages of development, this study successfully demonstrated a good example of XAI application to improve the interpretation of machine learning model performance in predicting water quality.
Article
Rapid changes in microbial water quality in surface waters pose challenges for production of safe drinking water. If not treated to an acceptable level, microbial pathogens present in the drinking water can result in severe consequences for public health. The aim of this paper was to evaluate the suitability of data-driven models of different complexity for predicting the concentrations of E. coli in the river Göta älv at the water intake of the drinking water treatment plant in Gothenburg, Sweden. The objectives were to (i) assess how the complexity of the model affects the model performance; and (ii) identify relevant factors and assess their effect as predictors of E. coli levels. To forecast E. coli levels one day ahead, the data on laboratory measurements of E. coli and total coliforms, Colifast measurements of E. coli, water temperature, turbidity, precipitation, and water flow were used. The baseline approaches included Exponential Smoothing and ARIMA (Autoregressive Integrated Moving Average), which are commonly used univariate methods, and a naive baseline that used the previous observed value as its next prediction. Also, models common in the machine learning domain were included: LASSO (Least Absolute Shrinkage and Selection Operator) Regression and Random Forest, and a tool for optimising machine learning pipelines – TPOT (Tree-based Pipeline Optimization Tool). Also, a multivariate autoregressive model VAR (Vector Autoregression) was included. The models that included multiple predictors performed better than univariate models. Random Forest and TPOT resulted in higher performance but showed a tendency of overfitting. Water temperature, microbial concentrations upstream and at the water intake, and precipitation upstream were shown to be important predictors. Data-driven modelling enables water producers to interpret the measurements in the context of what concentrations can be expected based on the recent historic data, and thus identify unexplained deviations warranting further investigation of their origin.