Comparison of Base Models - Performance of Base Models at Different Stations

G- ANFIS S-ANFIS

4.1 Performance of Base Models at Different Stations

4.1.2 Comparison of Base Models

105

value of u is relatively low, and that in turn reduces its contribution towards the ET.

The study on the effect of input meteorological variables provides a clear picture that can enhance prioritisation during the data collection process. The Rs

should be collected as it is the key meteorological variable for ET0 estimation, followed by other complementary meteorological variables in Peninsular Malaysia. It is noteworthy that the complementary meteorological variables needed to further enhance the ET0 estimations were different across the whole study area. In other words, for different stations, the optimum input combinations could be different. Discussion on this finding will be presented in this thesis in Section 4.1.3.

106 (b)

107 (c)

(d)

108

Figure 4.2: (a) MAE, (b) RMSE, (c) MAPE, (d) R² and (e) MBE of SVM Estimation at Different Stations with Different Input Combinations

109 (a)

(b)

110 (d)

111 (e)

Figure 4.3: (a) MAE, (b) RMSE, (c) MAPE, (d) R² and (e) MBE of ANFIS Estimation at Different Stations with Different Input Combinations

112

previously. Darker bands appeared in the heat map of MAE, RMSE, MAPE and R² periodically when the essential Rs was missing from the input combinations.

However, by comparing the performances of the models numerically, significant differences are discovered.

In comparison with the MLP, the SVM, generally estimated less accurately particularly for the cases with higher number of input meteorological variables. This phenomenon can be observed from Figure 4.2(c), where the MAPE of SVM is as high as 25 %. This was contributed by the high MAE as well as RMSE of the SVM estimated ET0. For instance, the MAE and RMSE of the ET0 estimated at Station 48603 (Alor Setar) by the MLP using C1 as input combination were 0.0277 mm/day and 0.0357 mm/day, respectively, whereas the corresponding values of the SVM were 0.0437 mm/day and 0.0538 mm/day.

Despite having higher estimation errors with C1, the performance of the SVM did not deteriorate much as the number of input meteorological variables decreased. In fact, when using C44 as the input combination, the MLP registered MAE and RMSE of 0.1383 mm/day and 0.1835 mm/day, in which the performance declined significantly as compared to the C1 with all meteorological variables as input (MAPE increased by 2.5835 %). On the contrary, the MAE and RMSE of the SVM with the same condition are 0.1376 mm/day and 0.1835 mm/day. The resultant decrement in accuracy (MAPE increased by 2.1954 %) was not as bad as the MLP. In fact, the accuracy of the SVM was higher than the MLP in this case (i.e., C44). This suggests that the

113

SVM has better generalisability and resilience towards the reduction of input meteorological variables. A similar phenomenon also can be observed from the lower variance of the R² values at different stations when ET0 was estimated using the SVM with different input combinations.

The performance of the SVM can be explained by examining the working mechanism of the model itself. The SVM transforms data into a feature space by using the kernel function (the RBF function for this study). Estimation of the SVM is accomplished by selecting a support vector structure that can minimise the risk or loss function. In other words, the SVM performs regression analysis by viewing the feature space as a big picture in contrast to the point- by-point adjustment of the MLP. The high generalisability and robustness of SVM allow it to be used for many cases, but at the same time sacrifices the possibility of achieving very high accuracy (which could also be overfitting).

Nevertheless, a major pitfall appears to be existing in the SVM. Figure 4.2(e) shows that the SVM had the tendency to overestimate ET0 when the input meteorological variables were lesser. Besides, when comparing Figure 4.2(e) with Figure 4.1(e), it can be observed that the MBEs of the SVM’s estimations (ranged between -0.15 mm/day to 0.2 mm/day) were higher than those of the MLP (ranged between -0.015 mm/day to 0.015 mm/day). This could be attributed to the relatively lower accuracy of the SVM, which consequentially affects the MBE of the SVM.

114

Figure 4.3. The ANFIS shares similar structure with the MLP, whereby they are composed of neural networks. The distinct feature of the ANFIS from the MLP is that the neural network of the ANFIS is used to tune the fuzzy rules and the membership functions instead of the weights and biases. The utilisation of the fuzzy rules in the ANFIS allows the description of the data in terms of likelihood.

Like the SVM, the MAE and RMSE of the ET0 estimation by the ANFIS at Station 48603 (Alor Setar) using C1 as the input combination were 0.0438 mm/day and 0.0586 mm/day, respectively, which were relatively higher than those of the MLP. This corresponds to the MAPE of 1.0241 %. On the contrary, when C44 was used as the input combination, the MAPE only reduced by 2.1995 % (2.5835 % for MLP) with the MAE and RMSE, standing at 0.1379 mm/day and 0.1834 mm/day, respectively. Yet again, the MLP was outperformed by another model when input meteorological variables were reduced.

It is interesting to note that the ANFIS can successfully overcome the problem of bias error. The MBE of the ANFIS estimated ET0 at all stations mostly lied between -0.002 mm/day and 0.002 mm/day with very few exceptions. Furthermore, the MBE of the ANFIS’s estimations appeared to be unaffected by the input combinations. This observation could justify the versatility of the ANFIS towards various input combinations that renders its robustness and generalisability.

115 4.1.3 Selection of Input Combinations

As illustrated in Figure 4.1, Figure 4.2 and Figure 4.3, there were 63 input combinations tested at each station for the estimation of ET0. The use of different input combinations resulted in the variance in the output accuracy, even though the same machine learning model was employed. For screening purposes, this study utilised the BMS algorithm to select the best input combinations in different scenarios.

Specifically, C1 will be selected for the scenario when all the six meteorological variables are available for making ET0 estimation. When only five meteorological variables are available, the best input combination will be selected from C2 to C7 and so on. The selection was made based on the posterior model probability depicted in Equation (3.8). The best input combinations for each scenario at different stations are summarised in Table 4.1.

According to Table 4.1, at all of the studied stations, it was agreed that C1, C4 and C58 were the best input combinations when six, five and one meteorological variables were fed into the models, respectively. The C4 combination removed the Tmean from the input dataset, whereas C58 only used the key Rs as the sole meteorological variable. When only four meteorological variables were fed into the MLP, SVM and ANFIS, C13 was selected as the best input combination except for Station 48623 (Lubok Merbau) and Station

116

from C4, whereas the C17 combination excluded the RH from C4.

On the other hand, for the MLP, SVM and ANFIS trained with only three meteorological variables, 83.33 % of the studied stations favoured C33, which was a result of the exclusion of RH from C13. This echoed the phenomenon observed when four meteorological variables were used to estimate ET0, where the Tmin and RH were selectively removed from the input combination. However, Station 48600 (Pulau Langkawi) and Station 48601 (Bayan Lepas) preferred to use C23 as the input combination, whereby all the temperature variables were omitted.

As for the input combinations that consist of only two meteorological variables, the discrepancy among the stations widens. There were three different input combinations selected at this stage, namely C43 (41.67 %), C44 (50 %) and C53 (8.33 %). Station 48632 (Cameron Highlands), located at higher altitudes, stood out amongst the other stations and used C53 for a two-variable estimation of ET0. The C53 combination consists of Tmax and Rs, which could be essential for the ET0 process at highland due to the lower average temperature throughout the year. The stations located in the northern region and east coast of Peninsular Malaysia, namely Station 48600 (Pulau Langkawi), Station 48601 (Bayan Lepas), Station 48603 (Alor Setar), Station 48615 (Kota Bharu), Station 48649 (Muadzam Shah) and Station 48657 (Kuantan) preferred to use C44. The C44 includes RH and Rs whereby the RH is another important driving force of

117

the ET process, and it also appeared in C1, C4 and C13. The other stations which are not located at the northern region and east coast of Peninsular Malaysia, including Station 48620 (Sitiawan), Station 48623 (Lubok Merbau), Station 48625 (Ipoh), Station 48647 (Subang) and Station 48650 (KLIA) used C43 that was composed of Rs and u. It was explained in the previous sub-section that the Rs, being the key meteorological variable could encompass the effect of many other meteorological variables except for the u variable. Hence for the two- variable estimation, it is reasonable to complement the Rs using u.

Table 4.1: Best Input Combination using Different Number of Input Meteorological Variables at Different Stations

Stations Number of Meteorological Variables

6 5 4 3 2 1

Station 48600 (Pulau Langkawi) C1 C4 C13 C23 C44 C58 Station 48601 (Bayan Lepas) C1 C4 C13 C23 C44 C58 Station 48603 (Alor Setar) C1 C4 C13 C33 C44 C58 Station 48615 (Kota Bharu) C1 C4 C13 C33 C44 C58 Station 48620 (Sitiawan) C1 C4 C13 C33 C43 C58 Station 48623 (Lubok Merbau) C1 C4 C17 C33 C43 C58

Station 48625 (Ipoh) C1 C4 C13 C33 C43 C58

Station 48632 (Cameron Highlands) C1 C4 C13 C33 C53 C58

Station 48647 (Subang) C1 C4 C13 C33 C43 C58

Station 48649 (Muadzam Shah) C1 C4 C17 C33 C43 C58

Station 48650 (KLIA) C1 C4 C13 C33 C44 C58

Station 48657 (Kuantan) C1 C4 C13 C33 C44 C58 By observing the sets of best input combinations, the 12 studied stations can be grouped accordingly into five different clusters. Stations that preferred the same set of input combinations were grouped into the same cluster. For example, both Station 48600 (Pulau Langkawi) and Station 48601 (Bayan Lepas) had the best input combinations of C1, C4, C13, C23, C44 and C58.

Therefore, the two stations were being classified into the same cluster. The

118

merely based on the observation of the preferred sets of input combinations of the 12 stations. This allows a closer look at the common characteristics of the stations and deduces the relationship among the stations. Table 4.2 shows the studied stations arranged in their respective clusters.

Table 4.2: Clustering of Stations Based on Best Input Combinations

Cluster Station Remark

Cluster 1 Station 48603 (Alor Setar) Station 48615 (Kota Bharu) Station 48650 (KLIA) Station 48657 (Kuantan)

Meteorological Stations in Airport

Cluster 2 Station 48620 (Sitiawan) Station 48625 (Ipoh) Station 48647 (Subang)

Urban Area Cluster 3 Station 48600 (Pulau Langkawi)

Station 48601 (Bayan Lepas)

Island Cluster 4 Station 48623 (Lubok Merbau)

Station 48649 (Muadzam Shah)

Thick Vegetation Cluster 5 Station 48632 (Cameron Highlands) Highland

Although the stations can be grouped into clusters, it is important to note that there are large intersections between the clusters. For example, Cluster 1 and Cluster 2 only differed when ET0 was estimated using five meteorological variables whereas Cluster 2 and Cluster 3 have only two disagreements in selecting the best input combinations. Clustering provides a better understanding on the relationship between the stations but not the ultimate guideline.

Undeniably, applying the BMS for input combinations screening has successfully eliminated 57 less promising input combinations at each station.

119

This effort would improve the computational efficiency where the researchers and the authorities need not look at all the meteorological variables when estimating ET0 in the future. Although there was partial disagreement among the studied stations when two, three, and four meteorological variables were used for training, one can still generalise the preferred input combination based on the number of votes each combination gets.

At this point, modelling works (training and testing) that involved the base MLP, SVM, and ANFIS have reached an end. The following sections will discuss the hybridisation of the base models to improve the model performance from different aspects. Also, the modelling work would not cover all the possible 63 input combinations, but only stress on the best input combinations obtained from the BMS screening (as shown in Table 4.1) as the unselected input combinations would produce less accurate estimations.

Dalam dokumen ROBUST DATA FUSION TECHNIQUES INTEGRATED MACHINE LEARNING MODELS FOR ESTIMATING (Halaman 129-143)