The main focus of this research is Silvescu's FNN, because such an activation function does not fit into the category of networks, where the linearly transformed input is exposed to activation. Regarding non-trivial Silvescu's FNN, its convergence rate is proven to be of order π(1/π). The paper continues to investigate classes of functions approximated by Silvescu FNN, which turn out to be from Schwartz space and space of positive definite functions.
In this section I mention some important people who contributed to this research during the preparatory stages. Tulenbayev from the mathematics department of Suleyman Demirel University who agreed to be a member of the thesis committee and review the paper, and Malika Uteuliyeva, now a third-year undergraduate student at Nazarbayev University, who helped with the Python implementation of the Silvescu Fourier Neural Network .
Introduction
General model
Existing implementations
- Gallant and White FNN
- Silvescu FNN
- Tan, Zuo and Cai FNN
- Liu FNN
Instead of nonlinear transformation of weighted sum of inputs, here we take the product of nonlinear filter of each weighted input. This makes Silvescu's implementation the most interesting and the most challenging to analyze among other FNNs. These two implementations resemble Fourier series, but with the exception that the coefficients are not fixed, as is the case with Fourier.
Discussion
That is, the order of convergence for the Silvescu FNN is π(1/π), given the fixed size of the dimension π and a limit πΆ for the Fourier transform. The convergence rate for all mentioned implementations of FNN is expected to be regular π(1/π). A very important method to derive the rate of convergence is used by Barron in his investigation of sigmoidal activation functions [2].
Indeed, the way ππΊπ was defined allows us to perceive it as a sigmoidal function, in which case we can directly apply Barron's results. In our paper, we use the main idea of ββBarron's error estimation and derive our proof for the Silvescu FNN. The main difference between these two results is in the assumptions on the moments of the Fourier transform of the function.
Such an assumption was also used by Jones [11] to show the rate of convergence of order π(1/π) for networks with the "affine input transformation" mentioned above. Although the functions can be approximated within the upper bound of the error shown, the error stops at a certain level. However, for a smaller number of neurons, the dependence of the error on the size of the hidden layer seems to experimentally prove the convergence rate of π(1/π).
Define the class of functions Ξwhich can be represented in the form above for some πΉΛ(ππ€) and finite. Then defineΞπ΅ for which the representation in Equation (2.1) holds for π₯βπ΅ for someπΉΛ(ππ€) with finite size measure.
Proof
More importantly, the last line of equation (2.4) represents π(π₯) as an infinite convex combination of functions in the set. Functions in πΊcos are in the convex hull of functions in πΊSilv, that is, πΊcos βπππΊSilv, where. For each function π βΞπΆ,π΅ and each function π βπΊSilv, the function
We will use Lemma 2 and Lemma 3 to show that π is in the closure of the convex hull of πΊSilv. Then, by Lemma 1, there exists ππ in the convex hull of π points of πΊSilv such that.
Functions in Ξ πΆ
Functions with absolutely integrable Fourier transform
In definition, it is very similar to positive definite kernels, which are important tools in statistical learning theory. We can observe that for positive definite functions, the πΏ1 norm of its Fourier transform is defined by the behavior of π at the origin. That is, one does not even need to know the transformation to calculate the constant πΆ.
This is a space of rapidly decreasing functions on Rπ (functions with derivatives that decrease faster than the inverse of any polynomial). The Schwartz space is widely used in Fourier analysis and applications can be found in the field of partial differential equations.
Properties of functions in Ξ πΆ
This can be viewed as the Fourier transform of π(π₯) centered on the set of points π€βRπ on a line along the direction π.
Examples of functions in Ξ πΆ
Note that sigmoidal functions of the form π(π₯) = π(πΒ·π₯+π) do not have an integrable Fourier transform. this fact is difficult in practice). Multivariate Gaussians are also in the same Schwartz space: this can be shown by a change of variables (first use shift and then a linear transformation). However, we now want to find the constantπΆ for this function based on our observations above.
Many other functions can be derived from the Gaussian, using the properties we have obtained. We will use some of them in the next chapter to evaluate the approximation of the functions in the Silvescu Fourier Neural Network we discussed earlier. The reason is because of the large error margin we get by setting the dimension too large.
However, our goal is to evaluate whether the convergence is of order π(1/π), where π is the number of hidden neurons. It was also interesting to see if the network could adequately approximate the Silvescu activation function.
Software
This is a Stochastic Gradient Descent (SGD) algorithm with momentum and an adaptive learning rate. The momentum accelerates convergence by adding some of the previous gradient to the current one. With the combination of momentum and adaptive learning rate, the algorithm is less likely to slow down as the gradient approaches zero around the minimum.
The reason we use the minimum values ββfor MSE is that our main theorem proved the existence of network with indicated MSE.
Hardware
Theoretically, we have proven that there exist such parameters of FNN, where the error of the network is bounded by a constant divided by the number of hidden neurons. However, the existence of such parameters does not imply that the boundary is the same for all trained models. This is the best case scenario and obtaining such a solution is not guaranteed in practice.
Nevertheless, the importance of this paper is in obtaining error bounds for networks that fall out of the "weighted sum of activation of linearly transformed input" paradigm. All standard models were described by Hornik [10] and Cybenko [3], and Barron [2] gave lower (Kolmogorov n-width) and upper bounds for approximation of sigmoid neural networks. The Silvescu FNN, on the other hand, has a non-trivial activation, which makes it more difficult to analyze.
The constant term 4πβ1 in the margin of error also appears due to the way the activation was constructed. To balance this expression, the number of neurons π must be exponential in the dimensionality π. The results of the experimental work were good for small values ββof π (number of neurons), in the interval 16 < π < 1024 the error leveled off, for larger values ββthe expected error is above the specified error limit.
Although the theoretical results provide an upper bound for the error, there must also be a lower bound. According to graphs (figures can be found in the appendix), the error decay is described as linear for π β€ 16, i.e. order of magnitude π(1/π), whereas on the right tail of MSEs it is best described as π( 1). However, this can possibly be explained by a large number of neurons and thus a large number of parameters.
This is also likely because the error stops decaying around 10β3, which is even smaller than the error for our generated functions (π = 0.03). Then small fluctuations in the right tail can be explained by Adam Optimizer finding different local minima during training phases. First possible improvement of the work could be the extension of the results for linear and polynomial functions.
More research in this direction could shed light on the analysis of the approximation by linear ranges of Silvescu's functions. A simple greedy approximation lemma in Hilbert space and convergence rates for projection pursuit regression and neural network training.