TAT M^ng

(1)

JOURNAL OF SCIENCE & TECHNOLOGV * No. W - 20l3

THE INFLUENCE OF INITIAL WEIGHTS ON NEURAL NETWORK TRAINING SU" ANH HU'ONG CUA BO TRQNG S 6 K.n6i T ^ O BAN DAU

TRONG QUA TRINH LUY$N MANG NEURAL

Cong Nguyen Huu', Nga Nguyen Tlii Thanh', Huy Vu Ngod', Anh Bui Tuan^

'Thai Nguyen University 'Thai Nguyen University of Technology Received March 14, 2013; accepted April 25, 2013

ABSTRACT

Neural netwod< is a software computational tool applied In many different technological aspects, such as automation and control; electronics and telecommunications; information techndogy... Tlie success of an artificial neural nehvoric depends significantly on network learning process. The result of learning process is affected by two main factors which include network training algorithm and iniM weights. This paper investigates the influence of the different initial weights on neural network training results generated by a proposed network training algorithm. The significant impact will clearly be shown when a special error is available in a neural network, for instance, the cleft error surface Obtained results of this study contnbute significantly to the improvement of learning algorithms ol neural networks, and especially, in cases of neural nehvorics with an error surface.

Keywords' Cleft-overstep, genetic algorithms, back-propagation algonthm, learning step, initial weights.

T 6 M TAT

M^ng neural Id mdt cdng cy tinh todn mim duxrc Cmg dijng trong nhiiu tinh vi/c khdc nhau nhir diiu khiin vd tir d^ng hda; di$n ti> viin thdng; cdng nghd thdng tin... Sy thdnh cdng cOa mdt m^ng neural nhdn tao phy thudc rit l(tn vdo qud trinh luydn mang. Kit qud luy0n mang phy thudc vdo hai yiu ti Id thudt todn luydn mang vd b0 tmng s6 ban diu. Bdi bdo ndy nghidn cCru si/ dnh hudng ciia b^ tn?ng s6 khdi tao ban diu tdi kit qud luyin mang neural vdi cung m^t thudt todn luydn mang. Sir dnh hu-dng ndy s§ thi hidn rd khi mang neural cd mdt l5i ddc bidt. vl dy nhw mdt l6i cd dang Idng khi Cdc kit qud nghiin ciru ndy gdp phin quan tn?ng trong vi$c cii tiin thudt todn hQc cua mang neural nil Chung vd mang neural cd mdt ldi ddc biit ndi ndng.

Til kh6a: vu-ol khe, giai thuat di t r u y l n , kJ thuat lan t r u y i n ngugrc, hirdc hpc. bO trpng s6 ban d^u

1. NEURAL NETWORK AND THE - Step 2: Creating die initial values for the NATURE OF THE NETWORK TRAINING weights and installing the parameters of the ALGORITHM network.

In recent years, artificial neural networks - Step 3: Alternately putting K samples to the have been successfully applied in many network from the input layer to output layer.

different fields: controi and automation; We can describe die calculation of die output electronics and telecommunications; signal in each class as follows:

information technology ... There are a lot of

algorithms that have been developed to train the "» ^ ^K: (""^"^ sample) neural network but Back Propagation (BP) is „^, ., / . , \ useddie most [1],[2]. " =/'"^'(w'"*'fl"' +b"^')

Back propagation with MLP (Multilayer where index m^O. 1.2...M-l Perceptron) network is described as follows:

o,„„ 1 D -J- • . a = a^ (a-network output) - Step 1: Providing a traming set of K input

sample pairs and target output result. " Step 4: Calculating the squared average error and propagating this error back to the previous layers.

(2)

JOURNAL OF SCIENCE & TECHNOLOGY * No. 95 - 2013

- Step 5: Updating die weights in die direction of decreasing the steepest gradient. Repeat the process from step 3 until the value ofthe mean squared error is acceptably small.

When choosing a neural network structure notes that: determining the number of layers, number of neurals In each class, the sample size, the activation function to match with the object to be approximated.

Back propagation technique combining with gradient decreasing algorithm and with the simple function has one extreme, the gradient decreasing method leads to the global extreme.

But the error fimction E in multi-layer neural network is a complex surface fiinction with many local exfremes, so this method can not guarantee to find a global extreme on the surface of this error function.

Two factors sfrongly affecting to find the optimal weights of the neural network are the leaming step and initial weight vector. In the researches, to improve the algorithm, it is common to change leaming step to overcome the local extremes. There is not any fixed leaming step for different problems. For each problem, leaming step is usually chosen by experiment imder the trial method and error, or there is suitable leaming step with each specific problem. The researches in this direction are presented in [3], [4], [5]. Here, this paper will study the effect of another factor, the initial weights, on the results of fraining a neural network with the same network fraining algorithms. This impact will be clear when the neural network has a special error surface, for example, cleft-error surface. The achieved results will play an important role to improve the neural network leaming algorithms in general and neural networks with special error surface in particular.

2. INVESTIGATING THE COIWERGENCE OF THE NEURAL NETWORK LEARNING PROCESS BY BACK-PROPAGATION TECHNIQUE COMBINING WITH GRADIENT DECREASING ALGORITHM, WITH DIFFERENT INITIAL WEIGHTS

Here, backpropagation technique is the error back propagation in network, the error

function is often chosen as a function that minimizes the squared average error. Method to adjust the weight of the algorithm is in the opposite direction to the Gradient vector ofthe squared average error function. For multi-layer neural network, the squared average error function is often complex and has many local extremes. The initial values of the weights strongly influence on the final solution. If the weights are initialized with the big value at the beginning then the input signal has a big absolute value and the output of the network is only at two values 0 and 1. This makes the system stuck in a local minimum or at a flat area near the starting point. The weights are usually initialized with small random numbers [6], [7]. According to research of Wessels and Bamard [6], the initialization of the linking weights w,j should be in the range

\-3/^kf,3/^, where k, is the number of links from the neural j to the neural i.

Currently, the Neural Network Toolbox already integrated a number of network training algorithms and different leaming step for us to choose; the initial weights for the network leaming process are taken randomly in a range.

To see clearly the influence of the Initial weight vector to the convergence of neural network leaming process we consider the following two examples:

2.1. Considering static nonlinear system needing to identify which has the mathematical model as follows:

y(u) = 0.6sin(fl-.u) + 0.3 sui(3.ff.u) + 0.1 sin (5jr.u) We send signal u(k) = sin(2;r.k/250) into the above system and measure the output signal y(k). Using the sample set (u(k),y(k)) to train network

The using neural network is the three layer MLP network: one input and one output.

Input layer has 8 neural, hidden layer has 8 neural, output layer has 1 neural, and activation functions of all three layers are tansig functions.

Tolerance error to train network successfully is 10"*. We use back-propagation technique with fixed leaming step 0.2.

J

(3)

Call IW •, the input layer weight matrix, the matrix has 1 column and 8 rows. Call LW^"', the hidden layer weight matrix, the matrix has 8 rows and 8 columns. Call LW ^'^ the output layer weight matrix, the matrix has 8 rows, 1 column. Each training network time with Initial weights [IW"""', LW^""', LW^'^'o^'are different, meaning differently random selection. The result that we obtain is differently optimal weights, the training cycles to train network are also different. Specifically:

Fig. 1. The era of network training

D i ^ a a ^ A ; * / , s p -

.v./.

Fig. 2. Simulation of real output, target and error of network

And we have the summary table as follows.

Based on Table I, we see that: with the fixed algorithm, structure and parameter are similarly selected, differing only in the initial weights that the era of network training and errors are different, it demonstrates that the result of network leaming process depends on the mitial weights.

No

/

2 3 4 5 6 7

Training cycles

66 11 28 22 46 29 207

Error (10') 9.8065 5.8464 9.8923 9.4931 9.9981 9.9062 9.5439

No 8 9 10 11 12 13 14

Trainin g cycles

24 45 62 55 37 29 60

E r r o J

»

(10-fl 9.968n 9.1781 9.574i 9.257J

"9.684Z;

7.196|

9.258«

2.2. Consider the nonlinear dynamical

i

system needing to identify which has the mathematical model as follows:

y = 0.00005 - 0.05>' - O.OOOSu - 0.5«v We put a random signal with amplitude limitation from 0 to 2 L/sec; sampling time is 0.1s, into the system and measuring the output signal. Take this input/ output sample set to train the network, total time is 100s, so 1000 sampling set will be create under the form of an array of data.

Neural network stmcture is chosen as follows:

Fig. 3. Neural network structure for example 2 Network consists of two layers: the ii^nit layer has 4 neural; activation function is tansig;

output layer has 1 neutral, activation function is purelin.

IW • is the input layer weight mafrix, the matrix with 1 row and 4 columns.

20

(4)

JOURNAL OF SCIENCE & TECHNOLOGY * No. 95 - 2013 LW • hidden layer weight matrix, the matrix

has 4 rows, I column.

LW ''^ is the feedback loop weight matrix from the output back to the input, the matrix has 1 row, 4 columns. We use back-propagation technique with fixed step 0.2. Tolerance error to train network is 10"'^.

We also train the network with different initial weights [IW'-'*"', LW'"°', LW'-^ '">]

meaning different random selection.

We also obtain different optimal weights;

the fraining cycles to fraln network are also different Specifically:

Table 2:

No

/

2 3 4 5 6 7

Training cycles

210 151 234 193 271 146 231

Error (I0-") 9.2147 9.6782 8.6745 9.3657 9.2486 7.6842 8.6575

No 8 9 10 11 12 13 14

Training cycles

301 229 234 167 205 212 203

Error (I0-'=) 8.9754 9.2367 9.2476 9.9874 9.5789 9.3487 9.3578 Based on Table 2, we see that: with the fixed algorithm, structure and parameter are similarly selected, differing only in the initial weights that the era of network training and errors are different, it demonstrates that the result of network leaming process depends on the initial weights.

3. Investigating the convergence of neural network learning process with special error surface by back-propagation technique combining with over-cleft algorithm with the different initial weights

Consider: Using neural networks to approximate some complex nonlinear objects.

For these objects, we need to select the multi- layer neural network and activation function is the sigmoid function to approximate the object.

So this probably leads to error surface when training network having many local exfremes and cleft form [3], [4] as shown in Figure 4.

n il 0 i j B< fit ei 1 .

Fig. 4. Cleft error surface In the previous paper [4] the authors have proved that when we use the Neural Network Toolbox for training the neural network with this special error surface, the network converges very slowly or even doesn't converge. In that paper, the authors proposed over-cleft algorithm and calculating method of over-cleft leaming step to update the weights of the neural network. It can be seen that over- cleft leaming step is better than others such as:

fixed leaming step, decreasing leaming step through the resulting statistic table of network training. Specifically, the number of failed network training times and the cycles to frain the network successfully reduced. In this paper, still using the back-propagation technique combined with over-cleft algorithm to frain the neural network with cleft error surface, the author will evaluate the impact of the initial weights to the problem finding global optimal extreme.

To illustrate, the authors have proposed neural network structure to identify the digits:

0, 1, 2... 9. Where the sigmoid function is used for generating cleft-error surface.

To represent the digits, we use a mafrix of 5 X 7 = 35 to encode for each character.

Corresponding to each input vector x is a vector of size 35 x 1, with the components ofthe value of either 0 or I. Thus, we can choose the input layer with 35 neural. To differentiate ten characters, we have output layer of the 10 neural. For the hidden layer we choose 5 neurons, we obtain network structure, as shown in Figure 5.

(5)

1

)

1 1 1

)

1 1 1 } 1 1 1 1 1

1 1 )

)

1 1 1 D )

! 1 ) 1 )

Fig. 5. Neural network structure to identify characters

The chosen function, f, is the sigmoid function because, in fact, this function is used for multi-layer neural network. Moreover, due to the property ofthe sigmoid function, it will generate a narrow cleft error surface, so it is used to illustrate the algorithm. The equation of the sigmoid function is: / = 1 / (1 + exp(-x))

Error function is used for fraining network: 7 = 0.5 * (z - 1 ) where z is the output of the neural output layer and t is the desired target value

Neural network used to identify the characters consists of three layers, the hidden layer weight matrix has size of 35 x 5 and output layer weight mafrix has size of 5 x 10.

The initial weights are taken differently and randomly around 0.5 which is the midpoint of the sigmoid activation function. After programming and network training 14 times we get Table 3.

Table 3 No

1 2 3 4 i 6

/

Training cycles

37 Failed

42 33 35 28 44

No 8 9 10 11 12 13 14

Training cycles

35 29 46 38 39 Failed

30 Based on Table 3 we see that with the same algorithm, the structure, the parameters of the network are the same but the results of network leaming process depend on the initial

weights, 2 failed times out of 14 networ);

training times.

4. PROPOSING MODEL COMBINING GENETIC ALGORITHM FOR MLP NEURAL NETWORK LEARNING PROCESS WITH THE SPECIAL ERROR SURFACE

As reviewed In section 2 and 3, through specific examples show the importance of the initial weights to the result of network leaming process. The studies [8],[9] also made similar conclusions.

In the study [8], the authors propose the use of the Cauchy's inequalify and a linear algebraic method, while in the study [9]

proposes the approximating the signal-flow equations of the network to obtain a linear system of equations with nonnegativity constraints to determine optimal initial weights for feedforward neural network. The research results prove the correctness of these proposals.

However, for complex nonlinear systems that use neural networks to approximate it, the generated error surface is cleft error surface;

there are not any effective mediods to optimize the initial weights. In this section, the author will present the study to optimize the neural network leaming process for this special error SLuface.

4.1. Comparing genetic algorithm and back- propagation algorithm in the problem of optimizing the weights of neural networks

As known, back-propagation algorithm of error to optimize weight of artificial neural networks is being widely used today. However, diis algorithm works by mechanism of reducing the gradient so it is difficult to find the global exfreme.

Meanwhile, Genetic Algorithms (GA) applied natural evolution to solve optimization problems in practice (from the set of initial solutions through evolutionary steps forms the new sets with better solution and eventually finding the optimal solution). [10]

Using genetic algorithms to the leaming process of neural networks to perform the following steps:

(6)

Randomly initialize

nO / 0 0 0 \

P"=(a;,fl2,....,a^)

population Table 4: Comparing GA and BP with error 0.1

I. Calculate the adaptive values / ( a ' ) of each chromosome a' in the current population P' 3. Based on adaptive values, new

chromosomes are created by selecting parent chromosomes for applying mutation and crossover algorithms.

4. Replace chromosomes with weak fitness by new better chromosomes.

5. Calculate adaptive values for new chromosomes f(,a') and insert to the new 6. Increase the number of generations if the iteration stopping criteria are not met and iterate from step 3. When the iteration stopping criteria are met, the output Is the best chromosomes.

To compare genetic algorhhms and error back propagation in the problem of finding the optimal solution, we use the handwriting recognition problem in Section 3, the selected parameters are common to both methods:

- Used neural network is a network with a hidden layer

- The number of neural in hidden layer: 5 - Error threshold for stopping the iteration: 0.1 or over 20000 iterations

- The parameter of error back propagation:

- Leaming step: 0.2

The parameter of Genetic Algorithm:

- No of populations: 20 - Crossover probability: 0.46 -Mutationprobability: 0.1

Here is the statistic table of iterations for network convergence with each option in 16 different trials.

(-): non-convergent network (iterations> 20000) No

I 2 3 4 5 6 7 8 9

OA 1356 729 1042 1783 1089

-

891 902 728

BP 3156

-

2578 3640 2671

-

2470

- -

No 10 11 12 13 14 15 16

GA 865 758

-

968

-

890 904 3 Fail

BP 1890 2348 2647 3378 2585

-

6 Fail We found that genetic algorithm is capable of achieving convergent requirements (error < 0.1) meaning that finding global exfreme Is easier than error back-propagation algorithm does. In other hand, error back- propagation algorithm is easier to fall into the area containing local extreme than GA does. In 16 iterations, GA only has 3 times not finding global extreme while BP has 6 times.

Using the above problem, we changed the error threshold to 0.001, we have the following table:

Table 5. Comparing 0.001

No 1 2 3 4 5 6 7 8 9

GA

'- -

2371

- - - - -

2038 BP 8019 9190 10923

-

9801

- - -

7781 GA No 10 11 12 13 14 15 16

and BP with error GA 3012

- - - - - -

13 Fail BP 8601

-

3378 9021

- -

I09I4 7 Fail Through this result can be seen that only very few cases of GA to achieve the desired error value. Combining the results in Table 4 and Table 5 we have the comparing table ofthe ability of the convergence of neural network when changing the error to stop iteration.

Table 6. Comparing GA and BP with different error

Error to stop iteration

0.1 0.001

No of convergent times in 16 iterations GA

13 3

BP 10 9

(7)

JOURNAL OF SCIENCE & TECHNOLOGV * I"Jo. Vb - Mi

From Table 6, we have comment:

Although GA is capable of reaching a global extreme for the searching process, but due to the combination of random factors so the searching speed is very slow, generally. On the other hand, it can not reach to the global extreme that only reaches around. In contrast to GA, error back-propagation algorithm (BP) allows to achieve the extremes if the starting point of the searching process is in the global extreme area.

5. COMMENTS

In this paper, the authors analyzed the effect of initial weights vector in the learning process. The effects are evaluated in three typical examples for the approximation of different systems including static non-linearity, dynamic non-linearity and special non-linearity.

Based on the numerical experiment approach, it is shown that, with common error surfaces, the initiation of the random initial weights in a certain range affects only the network training period. In cases of special error surfaces containing more extrema and cleft bottom, it can also make the network learning process failed since this process Is stuck in the local extremum. This is because the neural network leaming process generates different error

surfaces when an approximation of differan systems is performed. With complex error surfaces having many local exfremum and clefi bottom, since the nature of the eiror backpropagation algorithm Is the reduced gradient deviation, the initiation task of small random initial values ofthe weights will cause the network to converge to different minimuni values. The neural network could be involved in local extremum or stuck in a certain cleft and could not get out of this problem. This will lead to network training failure, because it starts from the area which does not contain global extreme. The authors, therefore, have mentioned the combination of locally characteried leaming of the ANN and globally characterized algorithm, such as genetic algorithms. GA will localize the area containing the global extremum ofthe error fiinction. Then BP derived from the initial weights moves to the global extremum. This is an important conclusion which can be treated as a prerequisite for a specific proposed algorithm combining the two approaches. This combination can offer an improvement in accuracy and speed of convergence of a neural network leaming process consisting of special cleft error surfaces and other types of error surfaces.

REFERENCES

1. Steve Lawrence and C. Lee Giles (2000), Overfitting and Neural Networks: Conjugate Gradient and Backpropagation, Intemational Joint onference on Neural Networks, Como, Italy, July 24- 27, 114-119,(2000).

2. D.E. Rumelhart; G.E. Hinton and R.J. Williams, Learning internal representations by enor propagation, Rumelhart, D.E. et al. (eds.): Parallel distributed processing: Explorations in the microstmcture of cognition (Cambridge MA.: MIT Press), 318-362, (1986).

3. Nguyen Van Manh and Bui Minh Tn, "Method of "cleft-overstep" by perpendicular direction for solving the unconsfrained nonlinear optimization problem". Acta Mathematica Vietnamica, vol.

15,N02,(1990).

4. Cong Nguyen Huu, Nga Nguyen Thi Thanh, Ngoc Van Dong; "Research to improve a learning algorithm of neural networks"; T^ip chi Khoa hpc Cong ngh? - D^ii hoc Thai Nguydn, thing 5 nam J. Hertz, A. Krogh, and R.G. Palmer, Infroduction to the Theory of Neural Computation, New York: Addison-Wesley, (1991).

A. J. Al-Shareef and M, F. Abbod, "Neural networks initial weights optimisation," in Proceedings ofthe 12th Intemational Conference on Modelling and Simulation (UKSim '10), pp. 57-61,

(8)

L.Wessels, E.Bamard, Avoiding False Local Minima by Proper Initialization of Connections, IEEE Trans, on Neural Networks, (1992).

Jim Y.F.Yam, Tommy W.S.Chow, A weight initialization method for improving fraining speed in feedforward neural network, Neurocomputing 30 (2000), 219-232.

SteHos Timotheou, A novel weights initialization method for the random neural network, Neurocomputing 73 (2009), 160-168.

D.E. Goldberg, Genetic Algorithm in Search, Optimization and Machine Leaming, Addison Wesley, Reading, MA, (1989).

Author's address:'Nguyen Huu Cong-Tel: 0913589758, e-Mail: huucong@tnut,edu.vn Thai Nguyen University

Tan Thinh Ward, Thai Nguyen City