CHAPTER 4: ARTIFICIAL NEURAL NETWORKS AND TRAINING
4.2. Concept of Back Propagation (BP)
59 Using multiple layers of hidden neurons in the hidden layer facilitates better processing power and flexibility of the system [65]. However, too many hidden neurons normally result to over specification of the system thereby making it incapable of generalization, while few hidden neurons result to improper fitting of the input data by the system thereby reducing system robustness.
Architectural definitions of MLP networks are very important as lack of adequate choice of layers and neurons for connection can prevent the network from solving problems by inadequate adjustment of weight parameters. Architectural optimization of the connection of hidden layers and neurons for establishing ANN that effectively solve a given problem remains one of the tasks yet to be solved in many research areas[118].
60
1,...
M i in
y w w (4.13)
Figure 4.3. A sketch of a function with local minima and local maxima [69].
The error is also dependent on the neuron weights which eventually need to be modified in the network to permit learning. A conventional algorithm used in finding weights set which minimizes the error is the gradient descent algorithm while backpropagation is applied in the calculation of the direction of the steepest descent [122].
4.2.1. Gradient Descent
Neuron weight permits ANNs to learn by updating after the forward passes of data through the network. The reason for weight adjustment is to ensure reconciliation of the difference between desired value and actual value ensuing forward passes [65]. Error is an important measure to ascertain the differences, and the respective error of each neuron send backward through the network to aid the process of update, i.e. backpropagation of error.
Cost function for error determination based on neuron weights as shown in figure 4.4 can be applied and the lowest point on the cost function known as the optimal value i.e. local minima where rate of function change equals zero can be ascertain. Conceptual use of slope of the angle of the cost function at the present location reveals the direction of the slope [123].
Algebraically, a negative slope indicates a downward movement while a positive slope indicates an overshoot i.e. movement beyond the optimal. The slope is determined using gradient descent.
61 Gradient is the rate at which a function changes while descent implies exploring at the base of the cost function with changing gradient. It takes into account a total of forward pass data set and calculates cost and thereafter propagates the errors backward to the neurons through the network [78]. There are two type of gradient descents: (i) vanilla plain and (ii) stochastic gradient descents [124]. All data weights are repeatedly adjusted when applying Vanilla Plain Gradient Descent (VGD) while Stochastic Gradient Descent (SGD) samples data randomly. Learning can be speed up by data random sampling for an improve prediction result. Gradient descent shows vulnerability to local minima if all data instance is applied in weight adjustment determination and may be made less vulnerable to extremes and outliers by considering the data en bloc; however, this is undesirable when in search for global minima.
Figure 4.4. A sketch of cost function [125].
The method of gradient descent comprises the calculation of derivative of squared error function with respect to network weights. This is achieved using backpropagation. Squared error function is expressed (considering an output neuron) as:
 
21 0
2 t M
E y (4.14)
62 where E and ot are squared error and expected output respectively, yM is actual output and ½ cancels the exponent during differentiating. For each neuron i, the output ok is defined as:
 
1 k k
n
i ik k
net w o
o  
 
  
 
 (4.15) The input neti is the weighted sum ok of output preceding neurons, if netiis in first hidden layer, then ok is inputs xi to the network. n is the neuron input units number, wik is the weights between i and k neurons and φ is the activation function which is non-linear and differentiable such as logistic function [126]: 
11 z
z e
  
 (4.16) The calculation of partial derivative of the error E with respect to the weights between neurons wik is performed using chain rule twice as:
k i
ik k i ik
o net
E E
w o net w
 
 
    (4.17)
where a single term in the sum neti is dependent on wik, therefore:
1 n i
ik k ik i i
ik ik k ik
net w o w o o
w w  w
   
   (4.18)oi = xi for first layer neurons.
Assuming the use of a logistic function, output derivative of neuron k w.r.t.to input is described as:
  
1  
k
i i
i i
o net net
net net  
   
  (4.19)
For this purpose, backpropagation requires an activation function to be differentiable.
For output layer neuron, then the evaluation is straightforward as:
63
k M
o  y and 1
 
22 t M M t
k M M
E E
o y y o
o y y
       
   (4.20)
However, if k is inner layer of the network, it is less obvious finding the derivative E with respect to ok. In view of E being function of inputs of all neurons N=p, q,…r receiving input from the neuron k, then:
 
k
p, q,.... r
k k
E net net net E o
o o
 
   (4.21)
Taking the sum derivative in light of ok, recursive example is obtained for the derivative as [118]:
i i
ik
i N I N
k i k i i
net o
E E E
o  net o  o net w
     
      
  
   (4.22) This can be a calculation of derivative w.r.t.to ok if every derivative with respect to the next layer output oi, the derivative nearer to the output neuron is known. Their summation gives [124]:i k ik
E o
w 
 (4.23)
   
 
1
) o 1
k t k k
ik
i N k
k k k
k
o o o for k output neuron w o for k inner neu o o
E
o net  ron
 
   
  
  
 
However, gradient descent algorithm with backpropagation is not a guarantee to achieving the global minimum, it merely guarantees a local minimum and display problem of crossing plateau in the scenery of error function [125]. These problems of non-convergence of error function might limits the performance of gradient descent with backpropagation including its non-requirement of input vector normalization because normalization improves network performance [127, 128].
64