Journal of Cleaner Production 416 (2023) 137885
Available online 28 June 2023
0959-6526/© 2023 Elsevier Ltd. All rights reserved.
A multi-model data fusion methodology for reservoir water quality based on machine learning algorithms and bayesian maximum entropy
Mohammad G. Zamani
a, Mohammad Reza Nikoo
b,*, Fereshteh Niknazar
a, Ghazi Al-Rawas
b, Malik Al-Wardy
c, Amir H. Gandomi
d,eaDepartment of Water Resources Engineering, Amirkabir University of Technology, Tehran, Iran
bDepartment of Civil and Architectural Engineering, Sultan Qaboos University, Muscat, Oman
cDepartment of Soils, Water, and Agricultural Engineering, Sultan Qaboos University, Muscat, Oman
dDepartment of Engineering and IT, University of Technology Sydney, Ultimo, NSW, 2007, Australia
eUniversity Research and Innovation Center (EKIK), Obuda University, 1034, Budapest, Hungary ´
A R T I C L E I N F O Handling Editor: Jin-Kuk Kim Keywords:
Water quality estimation
Bayesian maximum entropy fusion (BMEF) Fusion model
Machine learning models
A B S T R A C T
A major concern in the management of reservoirs is water quality because of the negative consequences it has on both environment and human life. Artificial Intelligence (AI) concept produces a reliable framework to recognize complicated and non-linear correlations between input and output data. Although various machine learning (ML) algorithms in recent studies were employed to predict water quality variables, the existing literature lacks exploring the combination of these algorithms, which has the potential to significantly amplify the outcomes achieved by individual models. Thus, the current study aims to bridge this knowledge gap by evaluating the precision of Random Forest Regression (RFR), Support Vector Regression, Multilayer Perceptron (MLP), and Bayesian Maximum Entropy-based Fusion (BMEF) models to estimate such water quality variables as dissolved oxygen (DO) and chlorophyll-a (Chl-a). The comparisons were conducted in two primary stages: (1) a com- parison of the outcomes of different ML algorithms with each other, and (2) comparing the ML algorithms’ findings with that of the BMEF model, which considers uncertainty. These comparisons were evaluated using robust statistical measures, and, finally, to indicate the utility and efficacy of the newly introduced framework, it was efficiently utilized in Wadi Dayqah Dam, which is situated in Oman. The findings indicated that, throughout both training and testing phases, the BMEF model outperformed individual machine learning models, namely MLP, RFR, and SVR by 5%, 26%, and 10%, respectively, when R2 and Chl-a are considered as evaluation index and water quality variables, respectively. Additionally, as the individual ML models are not capable of predicting electrical conductivity and oxidation-reduction potential efficiently, the BMEF model leads to better results by R2 =0.89, which outperforms MLP (R2 =0.81), RFR (R2=0.79), and SVR (R2 =0.62) for oxidation-reduction potential. Regarding the study limits of the present study, spatio-temporal data should be collected over a long time to increase the data frequency and reduce the uncertainty related to climate variability.
1. Introduction
Water quality monitoring is of paramount importance in the effective management of water resources, assuming a pivotal role as an indis- pensable instrument for decision-makers and stakeholders (Behmel et al., 2016). By providing precise and measurable data, water quality monitoring facilitates the efficient and sustainable management of both ongoing and emerging environmental dilemmas (Alcamo, 2019;
Jahanshahi and Kerachian, 2019). Furthermore, water quality
monitoring involves the meticulous collection, precise measurements, and comprehensive analysis of water samples (Kelly et al., 2020). Its primary objective is to discern the complex physicochemical and bio- logical characteristics of a specific water body. Dissolved oxygen (DO) and chlorophyll-a (Chl-a) represent crucial water quality indicators utilized in the monitoring of various water bodies, especially reservoirs (Olds et al., 2011; Mazhar et al., 2023). The dissolved oxygen (DO) concentration in reservoirs indicates the equilibrium between different processes involved in oxygen production, including chemical oxidation (Chen and Liu, 2014). Chlorophyll-a (Chl-a) acts as a sign of algae
* Corresponding author. Department of Civil and Architectural Engineering, Sultan Qaboos University, Muscat, Oman.
E-mail address: [email protected] (M.R. Nikoo).
Contents lists available at ScienceDirect
Journal of Cleaner Production
journal homepage: www.elsevier.com/locate/jclepro
https://doi.org/10.1016/j.jclepro.2023.137885
Received 28 April 2023; Received in revised form 9 June 2023; Accepted 20 June 2023
concentration in reservoirs, thereby providing valuable insights into eutrophication. Hence, monitoring these water quality variables is imperative to ensure the sustainable management of reservoirs (Kim and Ahn, 2022).
Over the past few years, researchers have focused on enhancing water quality management by creating models that effectively simulate the intricate chemical, biological, and physical processes associated with a wide variety of contaminants (Tang et al., 2019). In pursuit of such a goal, a multitude of models have been utilized to evaluate water quality variables, with classification falling under process-based methods (Wang et al., 2016; Noori et al., 2020), statistical methods (Avila et al., 2018; Kumar et al., 2021), and machine learning based algorithms (Shah et al., 2021a,b; Shah et al., 2021a,b; Zamani et al., 2023). Process-based methods have been extensively employed to esti- mate water quality status in various water bodies (Chen et al., 2021).
Nevertheless, these methods often exhibit sophistication and entail high computational costs, particularly when dealing with vast watersheds (Singh et al., 2021). Furthermore, while statistical methods are often derived from linear correlation coefficients, they tend to be less adept at capturing and handling complex connections between variables (Chak- raborty et al., 2023). As a result, machine learning applications have emerged as prominent solutions in the literature, addressing the limi- tations of process-based and statistical models (Javed et al., 2020).
To determine the application of ML algorithms for water quality variables regarding model developments, some in-depth analysis recently published is presented hereunder. For example, in a study developed by Park et al. (2014), support vector machine (SVM) and artificial neural network (ANN) were applied to forecast Chl-a concen- tration. The outcomes of their research demonstrated that the SVM model performs better in accurately predicting Chl-a concentration than the ANN model. Sarkar and Pandey (2015) utilized the ANN technique to model dissolved oxygen levels. Subsequently, the model’s perfor- mance underwent evaluation, revealing an accuracy of 0.9, considering R-Squared as the evaluation index. In assessing dissolved oxygen con- centration, Jafari et al. (2020) employed both regression and neural network models and their combination with wavelet function. The au- thors reported that the neural network model demonstrated superior estimation capacity compared to other approaches. Haghiabi et al.
(2018) developed two machine learning algorithms, SVM and ANN, to predict water quality variables. The findings of their study demonstrated that SVM performed better in estimating water quality characteristics.
Valera et al. (2020) utilized ML algorithms to estimate dissolved oxygen concentrations in both nearshore and offshore areas. Their study indi- cated that Random Forest Regression (RFR) consistently exhibited slightly better performance than Support Vector Regression (SVR).
These findings were attributed to the SVR model’s higher complexity in tuning and longer training time. Kouadri et al. (2021) developed eight artificial intelligence (AI)-based models to predict WQPs. The research’s findings indicated that the MLP model exhibited a higher level of ac- curacy in its performance assessment metric in comparison with other developed models. Kim and Ahn (2022) developed a Chl-a concentration predictive model using data collected from various stations located in the Han River basin. The study incorporated three machine learning algorithms, namely RF, SVR, and ANN. Alqahtani et al. (2022) devel- oped standalone ML algorithms as well as their combination to predict WQPs such as EC and total dissolved solids. They considered artificial neural network, gene expression programming, and random forest in their analysis. Their findings demonstrated that the recommended model demonstrated the comparative advantage of RF over alternative models. Carvalho et al. (2022) delved into the capabilities of machine learning models in predicting Chl-a levels in reservoirs while also exploring their relationship with hydrological and climate variables.
The findings of Carvalho’s study offered valuable perspectives on the potential impact of seasonal climate prediction and reservoir operation on water quality in areas that heavily rely on surface reservoirs.
To enhance the accuracy of the simulation model, data fusion may be used to integrate or combine the results of many models that rely on one data source (Behboudian et al., 2021). It offers the chance to draw further findings, which are impossible when using only data from a single source. Alqahtani et al. (2022) developed standalone ML algo- rithms and an integrated model to predict electrical conductivity and total dissolved solids, considering random forest, gene expression pro- gramming, and artificial neural network. To enhance the precision of Chl-a forecasting, a newly developed method was introduced by Zhang et al. (2023). Therefore, due to the high performance of fusion models, in the present study, Bayesian Maximum Entropy-based Fusion (BMEF) has been utilized to mix the outcomes of four estimation models to improve the final results obtained from individual algorithms. The literature has examples of various effective BMEF applications for managing water resources. For example, Hosseini and Kerachian (2017) used this procedure for examining the spatiotemporal fluctuations of levels of the water table and optimizing a network for monitoring groundwater table. Alizadeh and Mahjouri (2017) used the BMEF to enhance the spatial and temporal monitoring frequencies within sparsely monitored aquifers. Han et al. (2020) utilized the BME frame- work, allowing for enhanced precision and reliability in estimating soil moisture levels. Ghazipour and Mahjouri (2022) introduced an inno- vative methodology based on BMEF aimed at combining the predictors generated by four different individual machine learning (ML) models.
It has been noted that combining machine learning algorithms can Abbreviation
AI Artificial Intelligence ANN Artificial Neural Networks
BMEF Bayesian Maximum Entropy-based Fusion Cftools Curve Fitting Tool
Chl-a Chlorophyll-a
CTD Conductivity, temperature, and depth C–V Cross-validation
DO Dissolved oxygen EC Electrical Conductivity EI Evaluation indices FM Fusion model
HPO Hyper-parameter optimization Lat Latitude
Long Longitude
LOGSIG Logarithmic sigmoid
MAE Mean absolute error ML Machine Learning MLP Multilayer Perceptron
MLP-ANNs Multilayer Perceptron Artificial Neural Networks NSE Nash-Sutcliffe efficiency coefficient
ORP Oxidation reduction potential PMF Probability of mass function PURELIN Pure linear transfer function RFR Random Forest Regression R2 Coefficient of determination RMSE Root mean square error SVR Support Vector Regression T Temperature
TANSIG Hyperbolic sigmoidal WDD Wadi Dayqah Dam WQVs Water Quality Variables
help offer precise prediction and estimation models (Chobar et al., 2022). Moreover, the research indicates that the BMEF concept has not yet been used to enhance the output of estimation models for WQPs.
Additionally, water quality data used in the previous studies are usually measured at a single point and depth during the time, and the quality of the data at different coordinates and depths of such systems has not been measured with high accuracy. Therefore, to improve the results in the current study, fusion modeling with the BMEF has been developed to estimate the outcomes of distinct machine learning models. In other words, the objective of the proposed model is to merge the results of several estimating models in order to make use of each model’s strengths. Hence, a prognosis’ proportional weight in the fusion process increases the closer it is to the pertinent observation.
The current research introduces an innovative approach for evalu- ating the efficacy of individual machine learning algorithms, including MLP, SVR, and RF models, as well as comparing them with the BMEF technique. Thus, the primary objective of the proposed methodology was to.
1. Assess the predictive ability of individual ML models to estimate dissolved oxygen and chlorophyll-a in various depths and different locations (latitude and longitude).
2. Estimate water quality variables based on the fusion models based on BMEF.
3. Evaluate individual machine learning models versus the BMEF model.
4. Take into account the parameter specification and error assessment standards applied to the developed machine learning models and BMEF model.
5. Estimate Chl-a and DO in those locations and depths which have not been measured by the CTD (AAQ-RINKO) sensor.
2. Methodology
To assess the accuracy of data-driven approaches for estimating WQPs across diverse locations encompassing latitude, longitude, and depths, three types of machine learning algorithms were employed to estimate chlorophyll-a (Chl-a) and dissolved oxygen (DO), which were collected utilizing CTD (AAQ-RINKO) sensor. Fig. 1 illustrates the flow diagram depicting the suggested approach for forecasting the afore- mentioned water quality parameters. The framework encompassed the subsequent phases.
1. Data Collection and statistical preprocessing
In the current study, the water quality parameters utilized to develop the models were collected utilizing a CTD (AAQ-RINKO) sensor on Jan 15, 2023 in different locations and depths. Such sensor sensors were used to detect the physicochemical parameters such as electrical con- ductivity, oxidation-reduction potential, pH, water temperature, dis- solved oxygen, and chlorophyll-a in Wadi Dayqah Dam, located in Oman. In addition, the obtained data should be preprocessed before they are used in simulations. To prepare the data for the simulation process, non-parametric tests were applied to the data.
2. Choosing the most influential predictors (input variables)
The most effective predictors among various variables – such as latitude, longitude, depth, temperature, electrical conductivity, pH, and oxidation-reduction potential – are considered the model’s inputs for estimating WQPs. It should be noted that latitude, longitude, and depth are the variables that exist in all scenarios due to the goal of the present study. Hence, this is done using Pearson’s correlation coefficient to choose the variables with the most robust links with WQP variables.
3. Developing machine learning algorithms
Individual estimations of the WQPs used in this phase of the sug- gested technique, including chlorophyll-a and dissolved oxygen, are developed using the calibrated individual models. These machine learning algorithms were developed based on MLP, RFR, and SVR.
4. Developing model fusion
To enhance the outcomes obtained from individual machine learning models, a model fusion strategy based on Bayesian Maximum Entropy Fusion (BMEF) is being developed.
5. Comparing the results obtained from individual and fusion model To demonstrate that the BMEF model outperforms each separate machine learning model, data based on several evaluation indices were compared.
2.1. Multilayer perceptron neural network (MLP-ANN)
Artificial Neural Networks (ANNs) represent a principle of value and effectiveness in the realm of machine learning, offering unrivaled tools for researchers and practitioners (Alizadeh and Kavianpour, 2015). This remarkable tool transcends in predicting outputs, seamlessly adapting to the distinct characteristics of each encountered problem (Yadav et al., 2014). Drawing inspiration from the intricacies of the human brain, ANNs exhibit remarkable capability in accurately modeling a wide range of highly intricate and non-linear problems, yielding precise outcomes.
In situations where the data depart from well-defined mathematical functions, utilizing this method becomes instrumental in resolving the given problem. Artificial Neural Networks (ANNs) incorporate three distinct layers: the input layer, which encompasses the input variables; a series of hidden layers, housing numerous neurons; and the output layer, which encapsulates the output variable(s) (Jain et al., 1996; Rana et al., 2018; Ehsani et al., 2022). MLP-ANNs represent a prevalent and prag- matic variant of ANNs, where every neuron establishes connections with its neighboring neurons (Tao et al., 2023). Fig. 2 showcases a diagram illustrating the neural functioning of MLP-ANNs.
In Fig. 2, X1 to Xn represent the first to nth input variables, W1 to W2
represent the weights associated with such indicators, b represents a constant index, f represents a transfer equation, and z represents the output of neurons. Eq. (1) and Eq. (2) outline the step-by-step approach for computing the net and z values, as depicted below (Wang et al., 2021):
net=b+∑n
i=1WiXi (1)
z=f(net) (2)
The determination of weights, Wi and the constant number, b, within neurons, is accomplished through the utilization of an optimization method employed in the ANNs. Fig. 3 illustrates the presence of MLP- ANNs in this study.
In this study, Latitude (Lat), Longitude (Long), Depth (Depth), pH, Temperature (T), and Electrical Conductivity (EC) were the input vari- ables, and dissolved oxygen, as well as chlorophyll-a, were considered as output variables of MLP-ANNs.
2.2. Random Forest Regression (RFR)
Many decision trees and a set of predictors are used by the Random Forest approach to forecast and estimate the outcomes (Desai and Ouarda, 2021). It achieves the prediction results by constructing deci- sion trees using a training dataset, such as the mean forecast of each tree in a regression framework of a classification system (Ali et al., 2012). RF establishes numerous decision trees, and several bootstrap samples from the training dataset are used to plant each independent tree (Fig. 4). The
Fig. 1. Illustrative depiction of the flowchart outlining the suggested approach.
data is divided into homogenous subsets using the bootstrap sampling approach. The validity and accuracy of each tree are determined using the remaining samples after it has been generated and trained using a randomly selected subset of the samples (Biau, 2012; Liu et al., 2012;
Shah et al., 2022).
In the early comparison, when the only goal was to choose the best model, the grid search approach was employed to improve these model hyper-parameters. Table 4 shows that decision tree-based models like RFR, ABR, and GBR have satisfactory performance. The RFR model, a non-parametric technique, was chosen to simulate the association be- tween prospective predictors (watershed properties) and the target variables since, according to R-Squared, it performed the best (Table 4) (water quality).
2.3. Support vector regression (SVR)
The fundamental SVM model in this study is originally popularized by Vapnik and his colleagues (Cortes and Vapnik, 1995). In this pro- cedure, the main datasets obtained from the input space are subjected to a transformation using an appropriate kernel function. In SVM regres- sion, the objective is to determine the most favorable hyper-plane that minimizes the distance (Singh et al., 2014). To illustrate SVR, a straightforward linear regression problem trained on a dataset {ui,vi}; can be considered where i ranges from 1 to n, with ui representing input vectors and vi representing the corresponding target values. Subse- quently, the function g(u)is formulated to establish meaningful con- nections between the inherited relationships within the datasets. In Fig. 2. Graphic illustration demonstrating the functionality of neurons in
MLP-ANNs.
Fig. 3. The structure of MLP-ANNs (Taki et al., 2018).
Fig. 4. Architecture of the RFR model in the current study.
standard SVM regression, a loss function is employed to quantify the disparity between the estimated function and the original data. In this context, the standard Vapnik’s – ε insensitive loss function Lε(v,g(u))is utilized, formally outlined as follows (Cortes and Vapnik, 1995):
Lε(v,g(u)) =
{ 0,|v− g(u)| ≤ε
|v− g(u)| ≤ε,otherwise≥0 (7) g(u) =w.u+b defined the regression function; where wεZ, bεR is bias and (w,u)represents dot product of w as well as u vectors. The objective is to minimize the norm of ||w||2, ensuring a flatness property while considering model complexity. Therefore, the regression issue can be expressed as the subsequent problem of convex optimization.
w,b,ξ,ξmin∗ 1
2||w||2+c∑n
i=1(ξ+ξ∗) (8)
Subject to:
⎧⎨
⎩
vi− (w.ui+b) ≤ε+ξi (w.ui+b) − vi≤ε+ξi
ξi,ξ∗≥0,i=1,2,…,n
(9)
In this formulation, the slack variables ξi and ξ∗ are introduced to quantify the extent to which training samples deviate outside the ε− intensitive zone. Eq. (9) embodies the primary goal function. Subse- quently, a Lagrange function is formed by incorporating a dual set of variables, derived from the primal objective function, αi and αi for the associated limitations. In the following phase, the kernel function is utilized to incorporate non-linearity into the model.
Four kernel function options are available: radial basis function (RBF), polynomial, linear, and sigmoid. Within this group, RBF stands out as an extensively utilized kernel function in various applications. It takes the following form (Awad et al., 2015).
k(ui,u) =exp (
− γ⃒
⃒⃒
⃒ui− uj
⃒⃒⃒
⃒2)
(10) The parameter γ governs the level of smoothness exhibited by the decision boundary in the feature space. Ultimately, the exquisite attainment of the optimal regression function unfolds as follows:
g(u) =∑n
i=1(αi− αi)K((ui.u) +b) (11)
where 0≤αi,αi≤c, i= 1,2,…,n. αi and αi represent the Lagrange multipliers and K(ui.u) embodies the kernel function. Fig. 5
demonstrates the structure the SVR model in this particular investigation.
In this research context, support vector regression has been utilized to estimate the parameters which are associated with reservoir water quality.
2.4. Cross-validation approach for ML algorithms
Cross-validation is a model assessment providing a valuable under- standing of the learner’s predictive capability for a novel, unseen data. It serves as an indication of how well the model will generalize to unseen data (Sakizadeh, 2015). One approach to tackle this issue is to refrain from utilizing the complete dataset during the training process. Prior to commencing the training process, a portion of the data is set aside. When the training of datasets is complete, the remaining data is employed to assess the ability of trained models. This forms the fundamental concept behind a comprehensive class of model evaluation techniques known as cross-validation.
The division of data into training and validation datasets can significantly influence the outcomes of the models. Various approaches for implementing the theory of cross-validation have been suggested in previous studies. Nevertheless, the central components of all such ap- proaches lie in a similar essence (Berrar, 2019). Out of the various techniques available, the hold-out method was chosen for this study due to its straightforwardness and simplicity. Fig. 6 demonstrates the cross-validation approach used in the current study.
2.5. Bayesian maximum entropy-based fusion (BMEF) model
By taking advantage of the features of the geospatial approach known as Bayesian Maximum Entropy and integrating the results from the multiple estimation methods, a BME-based fusion model is imple- mented in the study. The BMEF is a method based on geostatistics that can compound variables or regionalize them in numerous locations in the absence of data (He and Kolovos, 2018). The method strikes a bal- ance between two conditions: The first one is where the previous knowledge and experience regarding the spatial variations of the esti- mated variables with the highest entropy are contained, and the other one is about Bayesian function maximization, with the least amount of ambiguity, which yields a posterior probability (Christakos, 1990, 2000).
Fig. 5. Architecture of the SVR model in the current study.
In the current research, a vector of 1st to nth forecasts is assumed to represent the anticipated water quality characteristics produced through the three distinct models of MLP, RFR, and SVR. The three models’ projections of the less precise values for the factors determining the quality of water are taken into consideration by the model. The devel- oped model, as it makes use of each specific model, can successfully complete the water quality requirements with greater confidence and accuracy. Fig. 7, provides an illustrative showcasing the fusion
techniques utilizing the developed BMEF models. Additionally, the BMEF stages are described below in more detail (Ghazipour and Mah- jouri, 2022).
1. Introducing the time-series data of predictors into the n standalone frameworks that are adjusted and confirmed.
2. Utilizing each of the n different models to derive n prior estimates of water quality variables.
Fig. 6. Cross-validation procedure during the training of ML models.
Fig. 7. A schematic representation of fusion steps using the BMEF model (Christakos, 2000; Ghazipour and Mahjouri, 2022).
3. Taking into account a vector of each water quality variable calcu- lated by individual estimation models and the related observation.
For the vector’s symbol, (x1j,x2j,…,xnj,zj), where, x1,x2,…,xn are 1st to nth estimates of the characteristics affecting water quality from each machine learning model at various depths i to n, and zj indicates the measured water quality variable at a specific depth.
4. Figuring out the separation between every data point and associated observation. According to how closely a prognosis of a model fits the matching data, the proportionate weight of the model in the fusion process rises.
5. Choosing soft and hard data based on the obtained results of ML and observation data; it is important to note that the primary benefit of the BMEF, distinguishing it from predicting data fusion models is that it helps the user to implement uncertainty into the problem (obtained from the developed machine learning models) rather than a precise number, but as a probabilistic or fuzzy one. By separating the data into soft and hard data, the uncertainties of the current data are taken into account while projecting water quality metrics. As already mentioned, hard data are regarded as precise measurements, and they often contain negligible mistakes. Soft data can be expressed as interval values or probability functions and often include estimation and observational errors. Therefore, in the cur- rent paper, the water quality metrics produced from each machine learning model are regarded as “soft data” since they are less precise.
As a result, instead of using a precise number, the different estimates of models have a great degree of uncertainty since they may contain probabilistic or fuzzy numbers. The characteristics of the soft data are expressed in terms of histograms or probability mass functions that are calculated using machine learning algorithms.
6. The computation of experimental covariance values; the link be- tween the distances of the data points and their significance is established by the BMEF analysis. The data points in this example indicate the estimated values of the different water quality variables, and the space between the data points shows the discrepancy be- tween the estimated values of the various water quality metrics by the various models and the corresponding real values. A specific kind of statistical correlation function called covariance demonstrates such importance. As a consequence, it is feasible to create experi- mental variogram points that can show a relationship between dis- tances and covariance values.
7. Generating suitable variograms that fit experimental covariance values; the experimental variogram points are matched with a desired mathematical function called a theoretical variogram. In the present research, the objective is to fit a spatial covariance model to the observed locations using factual or quantitative data. The var- iogram produced is used to compute the probability distribution functions (PDFs) of water quality variables through posterior analysis.
8. Using the given variogram to estimate unknown point values;
Assessment of the condition of water based on its characteristics parameters, Zj, are determined using the BMEF technique in the proposed BMEF method.
2.6. Performance evaluation
In the current study, the precision of the estimated values by ML was evaluated using analytical techniques such as Nash-Sutcliffe (NSE), R- Squared, MAE, and RMSE. The reason for using NSE is that it is commonly utilized and provides accurate information about recorded data. It assesses the degree of concordance between the actual and non- actual – that is – simulated data, specifically in terms of their alignment with the 45-degree line (Ritter and Munoz-Carpena, 2013; Faiz et al., 2018). The R-Squared value quantifies the extent of covariance between the estimated and actual data, elucidating the degree of their agreement.
On the other hand, MAE calculates the discrepancies between corre- sponding observations that reflect the same phenomenon. The RMSE
serves as a measure of the confidence interval that the model calculated for the variables’ units. The NSE and R-Squared have better performance when their values are closer to 1, while the best result for both MAE and RMSE is a value of 0. The following are explanations of these statistical indices.
NSE=1−
∑n
i=1(Ti− Pi)2
∑n
i=1(Ti− P)2 (12)
R2= [∑n
i=1(Ti− T)(Pi− P)]2
∑n
i=1(Ti− T)2∑n
i=1(Pi− P)2 (13)
MAE=1 n
∑n
i=1
|Ti− T| (14)
RMSE=
̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
1 nn∑n
i=1
(Ti− Pi)2
2
√
(15)
In the given context, the formula relates to statistical indicators involving the target value (T), predicted value (P), average target value (T), average predicted value (P), and the quantity of data points (n).
These statistical indices are determined individually for both the data used for training and testing in each model.
3. Case study
3.1. The context of the study
Oman is a country that is semi-arid and has minimal supplies of permanent surface water. One of the few wadis (river beds) in Oman, which has consistent running water is Wadi Dayqah (Prisk et al., 2009).
Around the dam, there are several mountains of all forms, a variety of wadis, and fertile fields, making the area a desirable spot for tourism activities, fishing, and agriculture. Therefore, not only does Wadi Day- qah Dam have the potential for hydropower technology, but it also has invaluable economic importance. Between the summer and winter seasons, there are climatic and precipitation shifts from mountains to coastal or upstream of Wadi locations. The coastal portions of the dam get high summertime temperatures and humidity, whereas the moun- tains experience reduced summertime humidity. In addition, there are patterns of precipitation during the winter, although some years are significantly wetter, causing severe floods because there are other water bodies near the dam. The region’s year-round sunlight and generally arid climate result in relatively little water flowing through the wadis.
Long water retention times in this dam and thermal stratification contribute to a) a significant reduction in water quality, b) a possible risk to human health, and c) higher treatment requirements in the Wadi Dayqah reservoir. A variation in temperature at various depths in the reservoir that affects the density of the water is referred to as thermal stratification. Furthermore, stratification separates water into two layers in a reservoir, the bottom layer of which is not allowed to receive at- mospheric oxygen. Fig. 8 shows the location of the dam and sampling stations to obtain water quality variables.
3.2. Data sources
In the current study, the in-situ data are collected from fast optical DO sensor, namely AAQ-RINKO, for the 34 sampling stations situated across Wadi Dayqh dam, demonstrated in Fig. 8. A CTD (AAQ-RINKO) is a primary instrument used to ascertain the crucial physical properties of seawater. To better understand how the seas affect life, it provides sci- entists with a detailed and accurate mapping of the distribution and alternations in water’s temperature, salinity, and density. Notably, the data acquired for the study area were obtained on January 15, 2023 in 34 sampling points in various depths from the surface to 35 m. The
summary of obtained data from CTD (AAQ-RINKO) is presented in Table 1. Also, the assumption that the data follows a normal distribution is invalid since most of the collected data is unevenly distributed. Thus, we use non-parametric tests to assess the data (Siegel, 1957). Such sta- tistical tests are reliable in light of the distribution of the data. As a result, making any assumptions about how the data are distributed is
unnecessary.
3.2.1. Uncertainty analysis
In the present research, the Monte Carlo analysis and resampling approach was applied to conduct an uncertainty evaluation of the results of both the singular prediction models and the data fusion model, BMEF.
Fig. 8. Wadi Dayqah Dam’s location and sampling stations.
Efron (1979) introduced the technique of bootstrap as a resampling method for computing the statistical properties of the variables. The theory underlying the approach is that, in the absence of knowledge about the variable’s real distribution, the values obtained from a sample are the most precise estimate of the fundamental characteristics of a variable. The bootstrap method suggests that an accurate approximation of the unknown population distribution can be achieved by utilizing the known empirical distribution derived from the actual data. This approach is useful for doing uncertainty analysis since it is compara- tively straightforward to use as compared to traditional analytical ap- proximations (Dai et al., 2007; Brus et al., 2008). According to Hall (1994), the fundamental principles of the bootstrap methodology involve random selection with the interchange of the raw data to ensure unbiasedness and constructing a Monte Carlo simulation to approximate the resampling process. Hall (1994) outlines the key stages in the bootstrapping procedure, which include performing random as well as conducting individualized sampling by drawing with replacement from the original data and utilizing a Monte Carlo simulation to approximate the bootstrap resampling technique. Fig S.1 illustrates a schematic representation of the bootstrap resampling technique.
4. Results and discussion
In the present research, three artificial intelligence techniques, together with their combination, known as BMEF, were developed to enhance the obtained results of individual machine learning models.
Such models used in the current study were developed to estimate WQVs in specific points, namely sampling stations, consisting of latitude and longitude, as well as various depths of the dam. In the current study, the total number of records collected is 5185, of which 75% is used for training each machine learning model, and what is left, that is, 25%, is used to test the developed models. The reason for such separation is the better achievement of ML algorithms in estimating WQPs compared to other separations.
4.1. Model development
4.1.1. The inputs and outputs of machine learning algorithms
Since the main goal of the current study is to establish a fusion model based on Bayesian Maximum Entropy, which integrates the outputs of the standalone ML models to enhance their accuracy, the initial inputs for the individual models are latitude (lat), longitude (long), depth (Depth), and those WQPs having a high correlation with Chl-a and DO in determined locations, namely, sampling points. In addition, the output (target) of the dataset can be each of the WQPs. However, in the present study, chlorophyll-a, as well as dissolved oxygen, were considered as output (target) variables. Before completing the selection of inputs for the ML models, an analysis to evaluate the sensitivity of the variables was performed to demonstrate the performance and potential impact of the model’s outputs. In the beginning, the models for predicting WQPs included latitude and longitude. Nevertheless, the findings demon- strated that performance was enhanced when depth was excluded from the model’s inputs. As a result, latitude and longitude were chosen as the main inputs for the estimation of WQPs. In addition, adding the WQPs, which demonstrate a significant correlation with Chl-a and DO can
increase the accuracy of individual machine learning models and that of the BMEF model. In addition to the primary input variables, regarding Chl-a as an output variable, water temperature, pH, and dissolved ox- ygen has the potential to enhance the accuracy of ML algorithms appropriately. Additionally, as for dissolved oxygen, the water quality variables that can enhance machine learning algorithms’ precision can be temperature and chlorophyll-a. The results of individual machine learning models and their inputs can be observed in Table 2.
Considering the findings unveiled by the results obtained from the individual and BMEF model, in approximately all scenarios, the accu- racy (percentage of correct WQPs forecasting) is increased when the water quality variables has a high correlation with its water quality variables. Such a phenomenon can be analyzed based on the fact that adding depth, which does not correlate with target variables, causes a decrease in the accuracy of ML algorithms. In contrast, adding the fourth predictor to the models, due to having a high correlation with output variables, causes an increase in the accuracy of such algorithms. In Table 1
Statistical summary of obtained data from Wadi Dayqh Dam.
Parameters Mean Standard deviation Min 25% 50% 75% Max
Temperature 23.19 0.11 22.54 23.17 23.18 23.20 23.67
Chlorophyll-a 4.00 6.93 0 2.45 3.57 3.92 195.65
EC 565.83 19.2 537.6 557.2 558.4 560.3 655.3
ORP 69.47 71.96 −224.56 42.10 85.24 115.75 186.12
pH 7.42 0.33 4.960 6.14 6.98 7.02 9.767
DO 6.45 2.01 0 6.94 7.06 7.17 10.70
Table 2
Percentage of corrected estimated WQPs using each model individually and the BMEF model.
Model Predictor WQPs Training
period Testing period
MLP Lat +Long Chlorophyll-a 0.86 0.87
Dissolved
Oxygen 0.89 0.9
Lat +Long +Depth Chlorophyll-a 0.83 0.84 Dissolved
Oxygen 0.89 0.91
Lat +Long +Depth
+WQPsa Chlorophyll-a 0.82 0.82
Dissolved
Oxygen 0.91 0.92
RFR Lat +Long Chlorophyll-a 0.81 0.81
Dissolved
Oxygen 0.84 0.85
Lat +Long +Depth Chlorophyll-a 0.81 0.81 Dissolved
Oxygen 0.84 0.85
Lat +Long +Depth
+WQPsa Chlorophyll-a 0.83 0.84
Dissolved
Oxygen 0.87 0.88
SVR Lat +Long Chlorophyll-a 0.74 0.72
Dissolved
Oxygen 0.79 0.80
Lat +Long +Depth Chlorophyll-a 0.71 0.74 Dissolved
Oxygen 0.74 0.75
Lat +Long +Depth
+WQPsa Chlorophyll-a 0.70 0.72
Dissolved
Oxygen 0.72 0.75
BMEF Lat +Long Chlorophyll-a 0.95 0.95 Dissolved
Oxygen 0.98 0.99
Lat +Long +Depth Chlorophyll-a 0.97 0.97 Dissolved
Oxygen 0.99 0.98
Lat +Long +Depth
+WQPsa Chlorophyll-a 0.93 0.94
Dissolved
Oxygen 0.99 0.99
a Those water quality variables which have a high correlation with output variables.
addition, considering the results of BMEF, which is obtained from in- dividual models to enhance the individual results, if so, their results significantly outperform the individual models. It is worth mentioning that because of the status of the data, training ML models for different water quality variables does not need the same amount of training time.
4.2. Tuning machine learning parameters used in this study
The process of tuning involves determining the best value of each parameter for training algorithms. The term ’’optimally’’ in supervised learning can refer to a variety of performance indicators (for example, the AUC’s error rate) and runtime, which under some circumstances can be substantially impacted by hyperparameters. The quantity of increase in performance increase that may be obtained by changing one hyper- parameter in comparison with default settings is referred to as
“tunability” (“tunability” of the hyper-parameter) or all hyper- parameters (Probst et al., 2019).
4.2.1. Hyper-parameter optimization (HPO) for MLP
With the help of a multilayer perceptron, the MLPRegressor algo- rithm in Keras’ Neural Network (NN) module can carry out regression tasks. Due to the absence of a universally applicable guideline for discerning the attributes of the hidden layer and its neurons, the Trial- and-error method was utilized to create a hidden layer and the neu- rons. The effectiveness of a network is substantially impacted by the quantity of the hidden layer neurons; when the quantity is small, the network may not achieve the required level of accuracy. Conversely, training time may increase if the quantity is large, and the model may occasionally exhibit over-fitting tendencies to the data. Several studies have supported the utilization of the trial-and-error approach (Zhang et al., 2023). To ascertain the ideal parameters of the suggested MLP, the trial and error method was adopted in the current study. An input layer representing latitude, longitude, depths, and WQPs has two hidden layers, with the initial hidden layer containing 50 neurons and the subsequent layer containing 25 neurons. ’’Adam’’ was chosen as an optimization technique to decrease the amount of MAE, which was selected as the loss function for the developed model. Furthermore, Rectified Linear Unit (ReLU), as an activation function, was utilized in both the hidden and output layers. Table 3 presents a summary of the HPOs in the MLP model. Fig. 9 demonstrates the loss function plots for the MLP model in predicting chlorophyll-a in the study area.
4.2.2. Hyper-parameter optimization (HPO) for RFR
In order to use the random forest (RF) model, the user needs to specify several hyper-parameters, such as the count of observations selected for each tree, the separating guideline, the smallest quantity of samples required for a node, and the count of trees. However, practi- cally, users may be unsure about the optimal values of these parameters that could improve the model’s performance beyond the default values.
Therefore, in this study, the tuned hyper-parameters of the random forest algorithm are displayed in Table 4 for a better understating of their impact on the model’s performance.
4.2.3. Hyper-parameter optimization (HPO) for SVR
Improving the forecasting precision of the SVR model is a major challenge that involves tuning hyper-parameters. To address this chal- lenge, a trial-and-error method was employed to optimize the five different parameters of the SVR model. These parameters included the kernel function type, which could be selected from their options (poly- nomial, linear, and radial basis functions), as well as sigma, C, epsilon, and the degree of the polynomial function, which was only considered if the polynomial kernel function was selected as optional. Fig. 5 illustrates the process used for fine-tuning the parameter values for the SVR model through trial and error. Table 5 provides a summary of the best values for SVR’s parameters.
4.2.4. The results of water quality estimation using the BMEF
This section shows the outcomes of water quality estimations using the BMEF. To reach the empirical covariance function values, both actual measures of water quality and predicted ones produced by different models are utilized. To estimate undefined water quality indices, MATLAB
″
cftools’’ is used to fit the suitable theoretical covari- ance function to the observed points. Once this has been done, the ob- tained covariance functions are then used to interpolate the undefined water quality indices.Models exhibiting a greater degree of uncertainty are considered
’’soft’’ data based on their estimated outputs with the purpose of address the uncertainty associated with the anticipated water quality indices with less precision. Empirical histograms provide a type of definition for such soft data. In general, four forms of soft data are primarily distrib- uted (Fig S.2.). In the current study, the probability distribution – that is – type 1 has been chosen by the proportional occurrence of the actual data to match the soft data.
The selected individual ML algorithms which have high influence on the outcomes of the fusion model are displayed in Table 6. Additionally, Table 7 shows the values for determining chlorophyll-a and dissolved oxygen using the BMEF model. According to the findings of the BMEF model, utilizing depth as a predictor reduces the accuracy of water quality estimation due to the low degree of correlation. Moreover, similar or improved outcomes are usually achieved by combining the outcomes of numerous models featuring a reduced set of predictors. In section 4.3, the outcomes of each model individually, as well as those of the BMEF model are comprehensively discussed. Additionally, all ML algorithms were employed in the fusion process because there were a total of three separate models.
Table 3
Optimal value of MLP’s parameters.
Parameters Optimal value
Number of Layers 3
Number of Nodes 20
Hidden Activation Function ReLU
Recurrent Activation Function Sigmoid
Optimization Algorithm Adam
Dropout 0.1
Recurrent Dropout 0.05
Loss Function MAE
Output Activation Function ReLU
Fig. 9.Charts showing the loss function of the MLP model for Chl-a prediction.
4.3. Comparison of the performances models
4.3.1. General comparison of individual machine learning models and BME model
To develop machine learning algorithms, the dataset is divided into sets for the training and testing phase. To do so, a portion of 75% of the dataset was utilized as the training dataset, and the remaining (25%) was selected as the testing dataset. Finally, the estimation of chlorophyll-a and dissolved oxygen were made in training/calibration as well as the testing period which is shown in Table 7. It is worth mentioning that five water quality variables were chosen to develop machine learning as well as the BMEF model, and two of them namely, chlorophyll-a and dissolved oxygen, were chosen to analyze them deeply considering the uncertainty and sensitivity of input parameters.
Furthermore, the results presented in Table 7 consider three input var- iables such as latitude, longitude, depth, and the best combination of predictors to estimate water quality variables.
Based on Table 7, which demonstrates the outputs of singular ML algorithms and that of the fusion, considering the NSE index, BMEF for estimation of WQPs has the highest accuracy compared to other ma- chine learning algorithms since this model uses the best performances of individual models. For example, the value of NSE for chlorophyll-a in the testing/validation period has a value of 0.82, 0.68, 0.78, and 0.86 for
MLP, RFR, SVR, and BMEF, respectively. In other words, such an index outperformed other machine learning models in BMEF as much as MLP, RFR, and SVR. On the other hand, since tuning the SVR’s parameters is obtained based on trial-and-error, this model failed to accurately esti- mate water quality variables, which was true for virtually all evaluation indices in both the testing and the training period. All in all, BMEF has a high level of similarity between observation and simulation values, and the SVR model has a weak performance in estimating water quality variables in different locations and various depths. The graphical types (spider plots) of evaluation analysis can be observed hereunder.
Fig. 10 (from a to d) illustrates the spider plots of the R-Squared, NSE, RMSE, and MAE for the developed model, which were employed to assess and validate the model’s performance. For approximately all models in training/testing periods to estimate WQPs, the BMEF pro- duced the best results, having the less value of MAE and RMSE and the highest value of NSE and R-Squared. In contrast, the SVR model had the worst capability in predicting WQPs, with the greatest MAE and RMSE and the lowest NSE and R-Squared. As for MLP and RFR models, both models have virtually good performance in estimating water quality variables in such a way that they have intermediate accuracy in BMEF and SVR models. It should be noted that the results of Chlorophyll-a, which consists of evaluating the statistical indices and line 1:1 are pre- sented in a detailed way for the best model prediction, namely the BMEF model.
4.3.2. Results obtained for dissolved oxygen and chlorophyll-a
Fig. 11 presents a visual comparison of the accuracy indices during the training and testing phases for the developed individual machine learning and a fusion model based on BMEF. The performance of each model could be assessed and categorized based on different classifica- tions. Throughout the testing phase, the NSE values fell within the range Table 4
Summary of described optimal parameters in the RF model.
Parameters max_depth criterion min_size sample_size n_trees n_features
Optimal values 15 MSE 7 10 1000 2
Table 5
Summary of described optimal parameters in the SVR model.
Parameters Kernel_ind Kernel_function Kernel_scale Box_constraint Epsilon
Optimal values 1 RBF 4980 1000 0
Table 6
Selected individual models employed in the BMEF model.
Predictor Selected standalone algorithms
Lat +Long MLP RFR SVR
Lat +Long +Depth MLP RFR SVR
Lat +Long +Depth +WQPs∗ MLP RFR SVR
Table 7
Performance of MLP, RFR, SVR, and BMEF considering all latitude, longitude, and depth as input variables.
Model NSE (train) NSE (test) R2 (train) R2 (test) RMSE (train) RMSE (test) MAE (train) MAE (test)
Temperature MLP 0.83 0.84 0.86 0.87 2.98 2.94 1.18 1.02
RFR 0.80 0.81 0.88 0.91 3.12 3.00 0.96 0.81
SVR 0.74 0.75 0.86 0.90 2.83 2.81 1.27 0.92
BMEF 0.94 0.96 0.93 0.93 2.23 2.20 0.79 0.75
Chlorophyll-a MLP 0.88 0.90 0.93 0.94 0.48 0.40 0.64 0.60
RFR 0.82 0.82 0.91 0.90 0.34 0.30 0.58 0.55
SVR 0.83 0.84 0.89 0.91 1.33 1.31 0.93 0.91
BMEF 0.90 0.91 0.96 0.98 0.3 0.3 0.29 0.25
Electrical Conductivity MLP 0.84 0.85 0.83 0.86 1.18 1.03 0.98 0.95
RFR 0.85 0.88 0.84 0.88 0.93 0.93 0.94 0.91
SVR 0.64 0.69 0.6 0.65 4.18 3.91 2.84 2.51
BMEF 0.89 0.91 0.87 0.91 0.64 0.66 0.54 0.50
Oxidation Reduction Potential MLP 0.9 0.92 0.9 0.92 0.23 0.21 0.13 0.15
RFR 0.88 0.91 0.86 0.89 0.28 0.25 0.18 0.25
SVR 0.61 0.68 0.66 0.70 1.4 1.29 0.73 0.75
BMEF 0.94 0.93 0.95 0.97 0.19 0.19 0.11 0.13
dissolved oxygen MLP 0.88 0.90 0.87 0.89 0.33 0.21 0.38 0.36
RFR 0.9 0.92 0.89 0.92 0.48 0.46 0.35 0.35
SVR 0.79 0.84 0.79 0.83 1.33 1.23 0.93 0.90
BMEF 0.92 0.95 0.92 0.94 0.28 0.24 0.29 0.27
of 0.65–0.96; therefore, thereby classifying them as deemed appropriate for the simulation of chlorophyll-a. Regarding R-Squared values, throughout both phases, all models were assessed as demonstrating strong performance. This classification was according to index values ranging from 0.65 to 0.99, as a higher value indicates better model performance. As a result, the simulated dataset exhibits a significant resemblance to the observed data.
To assess the model’s effectiveness in predicting chlorophyll-a, a
Taylor diagram (Fig. 12) was employed. The Taylor diagram serves as a graphical representation of one or more suitable depictions of a process, indicating the most realistic representation of developed models.
Derived from the diagram, it can be seen that the fusion model – that is – the BMEF model, yielded the most favorable outcomes for predicting chlorophyll-a. The analysis revealed that the BMEF model exhibited a correlation coefficient of 0.98 and a normalized standard deviation of 0.64. These findings indicate a robust correlation between the observed Fig. 10. Spider plot of the performances of the model in the estimation of (a) temperature; (b) oxidation-reduction potential; (c) dissolved oxygen; (d) electrical conductivity.
Fig. 11.The comparison between the accuracy of the model for machine learning models and BMEF for chlorophyll-a.
and predicted chlorophyll-a levels. These findings suggest that among the tested models, the BMEF model emerges as the most accurate and realistic representation of the chlorophyll-a prediction process. In contrast, the SVR yielded the least favorable outcomes, exhibiting a correlation coefficient of 0.6 and a normalized standard deviation of 0.96. These values indicate a weaker correlation and higher variability between the observed and predicted chlorophyll-a levels. The subopti- mal performance of the SVR model implies that it may not be a suitable choice for accurately predicting chlorophyll-a levels in reservoirs.
In Fig. 13, the boxplot visualizes the recorded and forecasted dis- solved oxygen values within the case study. The median values derived from both the MLR and BMEF models closely matched the observed dissolved oxygen values. These models exhibited variations in the lower quartile, the 25th percentile (Q25) as well as in the range of data, encompassing the minimum and maximum values. The BMEF model demonstrated a Q25 value that closely resembles the observed dissolved oxygen values, whereas the MLR model exhibited a data range that was close to the observed dissolved oxygen values. While the median values of the MLR, SVR, and BMEF models were close to the observed dissolved oxygen values, the BMEF prediction model displayed a particularly close alignment with the observed values during the testing period, surpassing
other standalone machine learning algorithms.
Regarding the R-Squared, Fig. 14 demonstrates that positive errors correspond to instances where the model is underestimated, while negative errors indicate overestimation. Furthermore, Fig. 14 reveals that data points positioned above the 45-degree angle indicate a lower significance level in the simulated data when compared to the actual data. In addition, a higher density of displayed points indicates better model performance.
4.4. Uncertainty analysis
In this section, the findings of an uncertainty analysis through the BMEF on two metrics related to water quality, specifically chlorophyll-a and dissolved oxygen, are provided, and then they are compared with findings from the individual models. The results of estimating water quality variables are subjected to an uncertainty analysis using a method of Monte Carlo analysis and bootstrap resampling. To this end, 100 identical-length samples are chosen from the initial calibration data.
Then, each set of samples is used to calibrate a different model. The water quality variables are then established throughout the verification step using the model, which is calibrated. Lastly, the measurements of evaluation index findings are used to create a probability mass function that correlates to each estimated water quality metric. For a more thorough analysis, the probability mass function of machine learning models for estimating chlorophyll-a as well as dissolved oxygen is pre- sented hereunder in such a way that four or more input variables, namely latitude, longitude, depth, and WQPs are considered. It is notable that the in-depth examination of estimating the proportion of chlorophyll-a under different scenarios and various inputs for both SVR, as the weakest machine learning model, and BMEF, as the strongest model estimation, are presented hereunder. As seen in Fig. 15, which demonstrates the probability mass function of dissolved oxygen for machine learning models, BMEF model, as well as field measurement data considering three input variables, BMEF tends to estimate the extent of dissolved oxygen in a range of 6–9, demonstrating the high level of similarity for estimating dissolved oxygen in comparison to observation values. Additionally, RFR and MLP have a satisfactory performance in estimating the amount of dissolved oxygen in different depths. On the other hand, the SVR model, under different circum- stances, does not estimate the amount of dissolved oxygen appropri- ately. All in all, BMEF, MLP ~ RFR, and SVR can be rated as the highest performance in estimating dissolved oxygen in various depths of the reservoir.
Fig. 15 shows the probability mass function of machine learning and the BMEF model. These results indicate that BMEF outperforms Fig. 12.Taylor plot of the models’ performance in predicting chlorophyll-a.
Fig. 13.Box plot comparing the model’s abilities to predict dissolved oxygen. Fig. 14.The best model of R-Squared (BMEF) for dissolved oxygen.