State-of-the-Art Deterministic ANN Methodology
3.4 F URTHER I NVESTIGATION OF D ETERMINISTIC ANN D EVELOPMENT M ETHODS USING S YNTHETIC D ATA
3.4.1 Synthetic Data Sets
Three different synthetic data sets (data sets I, II and III) were used to compare and assess the methods investigated in the following sections. The use of synthetically generated data enabled the methods to be properly assessed and compared without the complication of other uncertainties, such as unknown important inputs or an unknown error model, and without being affected by data limitations. More importantly, if real data were used to assess the input importance measures investigated, as discussed in Section 3.4.4, it would not be possible to check the accuracy of the measures in relation to the true contributions.
Furthermore, it was possible to generate “true” and “measured” target data, where the
“true” data were generated by the model function without the addition of a random noise component and the “measured” data were obtained by corrupting the “true” data with random “measurement” errors,∼N(0,1).
The data sets were generated with different degrees of nonlinearity, noise levels and sizes, to represent the variability of cases that could be encountered in a real forecasting situation. Additionally, each data set represents a different type of forecasting problem, including time series forecasting, where inputs are past observations of the data series (i.e. yt =f(yt−1, . . . , yt−K)); causal forecasting, where inputs are independent predictor variables (i.e. y = f(x1, . . . , xK)); and in-between, where inputs include past values of the data series and independent variables (i.e.yt=f(yt−1, . . . , yt−P, x1, . . . , xK−P)). To quantify the degree of nonlinearity in the data sets, a multivariate linear regression model was fitted to the noise-free, or “true” data. The fit of the linear model to the data was evaluated using the coefficient of determinationr2, given by (3.5), where the closerr2 is to one, the more linear the data are, and vice versa. To evaluate the noise levels in the data, the signal-to-noise ratio was evaluated as follows:
λ= σsignal2
σnoise2 (3.24)
whereσsignal2 is the variance of the “true” data andσnoise2 is the variance of the added noise component.
The generated data were divided into training, testing and validation data subsets us- ing the SOM data division method discussed in Section 3.2.2. In order to find the optimal grid size for each data set, the average silhouette widths and discrepancy values (see Sec- tion 3.2.2.2) were compared for SOM grid sizes ranging from 1×2 to 12×12. Once
clustered, 64% of the data samples were allocated to the training subset, 16% were allo- cated to the testing subset and the remaining 20% were allocated to the validation subset, ensuring that at least one sample from each cluster was allocated to each subset where possible.
3.4.1.1 Data Set I
The autoregressive model of order nine (i.e. AR(9)), given by (3.25), was used to generate 870 data points to make up data set I. This model, also used by Sharma (2000) and Bowden et al. (2005a) for the generation of synthetic data, is a linear time series model, including only past observations of the data as inputs.
yt= 0.3yt−1−0.6yt−4−0.5yt−9+ (3.25) Anr2 value of 0.999 was obtained by fitting a linear regression model to the “true” data, indicating that the generated data are linear as expected. The signal-to-noise ratio of the
“measured” data set wasλ = 1.92, which indicates that the data are fairly noisy (i.e. the strength of the signal is less than twice the strength of the noise).
A histogram showing the probability density of the data is given in Figure 3.9, where it can be seen that the data are approximately normally distributed, indicating that only linear rescaling (standardisation) of the data was necessary. It was found that a 1×6 SOM grid size was optimal for clustering this data set, which resulted in 6 clusters all containing more than 3 samples. From these clusters, 557 samples were allocated to the training data subset, 139 were allocated to the testing subset and 174 were allocated to the
yt
0.00 0.05 0.10 0.15 0.20 0.25 0.30
-5.5 -4.5 -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5 5.5
Probability density
Figure 3.9 Probability density of response variableytfor data set I.
Investigation of Deterministic ANN Development Methods – Section 3.4
validation subset.
3.4.1.2 Data Set II
Data set II, consisting of 1120 data points, was also generated by the AR(9) time series model, but with the linear addition of an independent nonlinear componentsin(2xt), as shown in (3.26), where x is an independent random variable uniformly distributed be- tween−πandπ.
yt = 0.3yt−1−0.6yt−4−0.5yt−9+ sin(2xt) + (3.26) Anr2 value of 0.864 was obtained for this data set when a linear regression model was fitted to the “true” data, indicating that the generated data are reasonably linear with a nonlinear component that the linear model cannot account for. The signal-to-noise ratio of the “measured” data set wasλ= 3.39, which indicates that the noise levels in the data are moderate (i.e. the strength of the signal is greater than three times the strength of the noise).
The probability density of the data is shown in Figure 3.10. Similar to data set I, the data are approximately normal; thus no further preprocessing was required apart from linear rescaling. For this data set, it was also found that a 1× 6 SOM grid size was optimal for clustering the data and again, each of the 6 clusters contained greater than 3 data samples. From these clusters, 717 samples were allocated to the training subset, another 179 to the testing subset and the remaining 224 to the validation subset.
yt
0.00 0.05 0.10 0.15 0.20 0.25
-6.5 -5.5 -4.5 -3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5 4.5 5.5 6.5
Probability density
Figure 3.10 Probability density of response variableytfor data set II.
3.4.1.3 Data Set III
1900 data points were generated by (3.27) to make up data set III. This function, suggested by Friedman (1991), describes a linear combination of both nonlinear and linear functions of independent random variables.
y= 10 sin(πx1x2) + 20(x3−0.5)2+ 10x4+ 5x5+ (3.27) Anr2 value of 0.775 was obtained by fitting a linear regression model to the “true” data, indicating that data set III is the least linear of the data sets considered. The signal-to- noise ratio of the “measured” data set wasλ = 23.94, which is significantly greater than the signal-to-noise ratios of data sets I and II. As the strength of the signal is almost 24 times the strength of the noise, the data are considered to contain little noise.
The probability density of the response variable y is shown in Figure 3.11. Again, the distribution of the data is approximately symmetric and did not require a nonlinear transformation. A1×8SOM grid size was found to be optimal for this data set, resulting in 8 clusters all containing greater than 3 data samples. From the clusters, 1215, 304 and 380 samples were allocated to the training, testing and validation subsets, respectively.
y
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0 5 10 15 20 25 30
Probability density
Figure 3.11 Probability density of response variableyfor data set III.