Uncertainty Quantification due to Training Data Selection and Size

V. Machine Learning System for Spent Fuel Analysis

5.8. Uncertainty Quantification of Machine Learning Models

5.8.2. Uncertainty Quantification due to Training Data Selection and Size

141

142

shows high values at some training points because the 100 different models developed from random bootstrap samples have large variance at those training points. This is due to statistical fluctuations associated with the random sampling, causing it to require large number of samples to attain convergence. For the test set, the GPR and SVM CI are larger than those of the training set because these models show large variance on the test set. Figure 5.15 shows the NN CI on the training and testing set. A large CI on the test set is observed. The NN may perform poorly on small dataset, and most importantly, is more prone to statistical fluctuations which occur during the random division of data into training and testing sets, random selection of the weights and biases, optimization of hyperparameters during NN training, and the random shuffling of the training set before each epoch. Thus, the effect of these fluctuations and the error due to limited size of the training set will be relatively larger in the NN than for models that are well suited for small datasets and less susceptible to statistical fluctuations such as GPR and SVM.

The error due to the limited size of the dataset can be decreased by acquiring a larger dataset which is possible with synthetic data. This would also eliminate or reduce the effect of the statistical fluctuations. This is summarized in Table 5.18 showing the maximum standard deviation of NN predictions on the original and synthetic PWR dataset. Besides, as shown in Figure 5.16, the CI on the original training and test set when the NN model is built with larger synthetic dataset is reduced, compared to Figure 5.15. Table 5.18 and Figure 5.16 are the results obtained from applying the method of bootstrap to the synthetic dataset and then applying the models developed from bootstrap datasets to predict the outputs of the original training and testing sets. The performance of the large synthetic data can be further explained for example, in the NN model as follows: the NN employs a function approximation with many unknowns. With small dataset, the approximating function shows fluctuations due to the paucity of input and output data. These fluctuations are reduced or eliminated when the dataset is large because the approximating function becomes smooth.

Table 5.18. Maximum standard deviation of NN predictions (W).

Dataset size Training set Test set 91 (original) 349.05 120.55 910 (synthetic) 56.96 86.75 9100 (synthetic) 12.17 10.01 18200 (synthetic) 10.36 4.92

143

Figure 5.13. SVM predictions and uncertainties due to training data selection and size. This applies to the PWR original dataset.

Figure 5.14. GPR predictions and uncertainties due to training data selection and size. This applies to the PWR original dataset.

144

Figure 5.15. NN predictions and uncertainties due to training data selection and size. This applies to the PWR original dataset.

Figure 5.16. NN predictions and uncertainties for synthetic PWR dataset model. The synthetic PWR dataset model is applied to the original dataset.

145

The fact that the point estimates of the ML models can be provided with associated uncertainties is important in establishing accuracy and reliability. Furthermore, the quantification of the uncertainties present in the point estimates will be important if the models are used for extrapolation i.e., the input features are outside the range shown in Table 5.3. Quantifying uncertainties from all known sources and then combining such information with measurement data (if available), measurement uncertainties (if known), the model predictions and bias, is a necessary step in obtaining BEPU values. Despite the large uncertainties on the small dataset predictions, the differences between the predicted and measured decay heat are not statistically significant. These differences are within the statistical and measurement uncertainties. In addition, the model bias is not included in the CI calculations because the square of the bias is much less than the variance of the predictions.

Two sensitivity tests were further performed to examine if the synthetic data generated from much smaller dataset can still improve the model accuracy and reduce the inherent statistical fluctuations. This time around, the PWR data was split into (i) 40% training and 60% test set and (ii) 20% training and 80% test set. Due to similar conclusions from the two tests, only the results for the 20% training and 80% test set are shown in Figures 5.17 and 5.18. The reference measured data are included in those plots. The average and gray band in those figures are obtained from the bootstrap method. And in Figure 5.18, the synthetic data is generated from the 20% training set.

This synthetic data is then used to create bootstrap samples, develop models from the bootstrap datasets and the models developed predict the outputs of the original 20% training and 80% test sets. The observation in Figures 5.17 and 5.18 are similar to those in Figures 5.15 and 5.16 in that the synthetic data seem to reduce the uncertainties due to training data selection and size. However, because the training data is smaller in Figures 5.17 and 5.18, some statistical fluctuations can still be noticed. From the result of the 40% training and 60% test set (not shown here), we can infer that the small training data set size should not be less than 40 samples. For smaller training datasets, the synthetic data will have to be much larger and a NN will have to be employed for training in this case. A robust regression model is one in which the model predictions are randomly distributed around the reference solution as shown in the figures presented in this section. From the sensitivity tests, we can further infer that the GPR is well suited and should be used for much smaller datasets in place of the NN.

146

Figure 5.17. Uncertainties due to training data selection and size (original PWR dataset).

Figure 5.18. Uncertainties when synthetic PWR dataset model is applied to the original dataset.

147

Dalam dokumen Development of a Spent Nuclear Fuel Radiation Source Term Framework (Halaman 155-161)