Analysis of iterative learning algorithms for the multilayer perceptron neural network.
Bacek, Tomislav ; Majetic, Dubravko ; Brezak, Danko 等
Abstract: In this paper, a comparison of different algorithms, used
in the training of a multilayer feedforward neural network (MLP), is
presented. Tested algorithms, which are of the first or the second
order, include both local and global adaptation techniques. Prediction
of nonlinear dynamic Glass-Mackey system is used as a benchmark problem.
To improve training speed and efficiency, bipolar sigmoidal activation
function with adaptive gain parameter is used. Furthermore, modification
of random weight initialization is proposed.
Key words: static neural network, adaptive activation function,
prediction, nonlinear chaotic system
1. INTRODUCTION
Neural networks (NN) are used in a wide variety of applications due
to their capacity to learn and to generalize. They are of especially
great interest in areas where problems cannot be solved by conventional
methods, such as dynamic system prediction. For this reason, these types
of problems can be used as a benchmark tests. Several experiments have
so far shown that neural networks can successfully deal with nonlinear
systems (Novakovic et al., 1998). They can be thought of as a mapping
function, which is basically a solution of prediction problems.
Most widely used algorithm in the training MLP networks is the
Error Back Propagation (EBP). In order to enhance the training
capability of the basic EBP algorithm, several modifications are
included--momentum, activation function (AF) with adaptive parameter and
modified weight initialization method. Yet, none of these modifications
managed to prevail the main limitation of the EBP method--dependence on
the size of the partial derivative. Therefore, in this paper, the EBP
method was compared with several major training algorithms.
Beside aforementioned modified EBP algorithm, another three
frequently used algorithms, namely Conjugate Gradient (CG), Resilient
Backpropagation (R_PROP) and Levenberg-Marquardt (LM), are also tested
and compared. Since there are several known versions of the CG and the
RPROP algorithms, two versions of each are tested in this paper. During
all tests, bipolar sigmoidal activation function, 6-13-1 network
structure and initial weights have not been changed.
Main goal of our research is to find the best MLP network learning
algorithm in regression and classification problems. A part of this
comprehensive research is presented in this paper.
2. FEEDFORWARD NEURAL NETWORK
Neural network used in this paper is a three-layered feed-forward
NN. Input of the i-th neuron of the k-th layer (with the exception of an
input layer) is a sum of weighted outputs of neurons of the (k-1)-th
layer, (1). Bias is the only neuron that has no input, because its
output is always one. It controls shape, orientation and steepness of
sigmoidal activation function (AF), and therefore needs to be included.
Neural network task in this study was to predict value of only one point
ahead by using values from the 4 past and a present point. Therefore,
input layer has 5 neurons (plus bias), while output layer has one
neuron. Hidden layer can have arbitrary number of neurons, so 12 neurons
(plus bias) are chosen in this paper. Sigmoidal AF of hidden layer
neurons is given in (2), whereas AF of the output layer neuron is a
simple linear function with unit gain.
net.sup.(k).sub.i] = [l-1.summation over (i=1)] [w.sub.ij] x
[y.sup.(k-1).sub.j], (1)
[y.sup.(k).sub.i] = ([net.sub.i]) = 2/1 + [e.sup.c.net]) -1, (2)
where y and net denote neuron output and input, respectively, and c
is AF's adaptive gain parameter.
In order to improve learning, a modification of random weight
initialization is proposed (Nguyen & Widrow, 1990),
W = 0.7 [H.sup.1/L] (-1 + 2 x rand), (3)
where H and L denote the number of neurons in layers connected with
the weight vector W, former referring to succeedding and latter to
preceding layer.
3. LEARNING ALGORITHMS
As mentioned before, four learning algorithms are tested and
compared. The basic EBP algorithm (Novakovic et al., 1998) has slow
convergence in case of small learning parameter [eta], and can lead to
oscillations in case of big [eta]. Hence, the basic EBP is modified with
both 1st ([alpha]) and 2nd ([beta]) order momentum, latter being set to
([alpha]-1)/3. The R_PROP algorithm (Igel & Husken, 2000) has so far
been presented in four versions. Since modified versions proved to
outperform basic versions, they are also used in this paper. Modified
RPROP versions are reffered to as iRPROP+ and iRPROP-. The CG algorithm
(Kasac et al., 2009) tested in this paper uses Fletcher-Reeves (FR) or
Dai-Yuan (DY) method for finding parameter [beta], because the FR method
is the most widely used, and the DY method is proved to achieve the same
level of accuracy as the FR method with the substantial reduction of the
computational time (Kasac et al., 2009). In conclusion, the fourth
analysed algorithm was the LM algorithm (Hagan & Menhaj, 1994). In
order to have more influence on the network behavior, coefficient [beta]
is actually given as [[beta].sub.dec] and [[beta].sub.inc]. Former is
used to multiply parameter [mu] when error decreases, while latter is
used when error increases.
Performance index used in this paper is the sum of squared errors,
E = [N.summation over (i=1)] [([d.sub.i] - [O.sub.i]).sup.T]
([d.sub.i] - [O.sub.i]), (4)
where N is the training set size, while [d.sub.i] and [O.sub.i]
denote desired and actual network response, respectively. All error
measures
are reported using non-dimensional Normalized Root Mean Square
error index--NRMS (Lapedes & Farber, 1987).
4. NONLINEAR CHAOTIC SYSTEM
Chaos is a common property of all nonlinear dynamic systems, with a
wide variety of nonlinear behaviors, which makes it a great benchmark
for testing different signal processing techniques. Since its definition
is simple, but its behavior hard to predict, Glass-Mackey chaotic system
is proposed as a NN benchmark (Lapedes & Farber, 1987). Discrete
Glass-Mackey dynamic system is defined as (Novakovic et al., 1998)
x(n - 1) = 1/1 + b [x(n - 1) + ax(n-[tau[)/1 +
[x.sup.10](n-[tau])], (5)
where a and b are constants, and [tau] is time delay. Sampling time
is [T.sub.0] = 1s. In this paper, a = 0.2, b = 0.1 and [tau] = 30.
In order to predict the behavior of nonlinear chaotic system, i.e.
signal value in P-th point ahead, a mapping function f(*) needs to be
determined from
x(n + P) = f(x(n), x(n - [DELTA]),..., x(n - m [DELTA])), (6)
where P denotes number of points ahead, [DELTA] denotes signal
delay, and m is an integer constant. In this paper, P = [DELTA]/l=6,
m=4.
The Glass-Mackey discrete-time series benchmark, used in this
paper, was generated using Eq. (5) and consisted of 1000 points. First
500 points were used for learning, whereas the remaining 500 points were
used for the testing of algorithms.
5. EXPERIMENTAL RESULTS
Every network learning process was carried out using 35000 learning
steps. During this process, network was tested after every 1000 steps
for there is no guarantee that the test error will have strictly
decreasing manner as learning proceeds. If test error decreased compared
to a previous one, weights were saved. Otherwise, they were not
considered. Table 1 shows learning and test errors for all algorithms,
as well as step in which the smallest registered test error encountered.
Comparison of NRMS test error curves for the EBP, iRPROP-, CG DY and LM
algorithms is presented in Fig. 1.
[FIGURE 1 OMITTED]
From the presented results it can be seen that the best results
were accomplished with the LM algorithm. The problem with LM algorithm
is time consumption, rising up from computational requirements of each
step. Nevertheless, this drawback is surpassed by the increased
efficiency (after only 2000 steps LM algorithm already outperformed the
best results of all other tested algorithms).
Fig. 2 depicts the best NN test result. It can be seen than NN
learned its prediction task on previously unseen data with high
accuracy, which confirms NN generalization capabilities.
[FIGURE 2 OMITTED]
6. CONCLUSION
Comparison of different learning algorithms, used in the training
of a static NN, is presented. Modified weight initialization method and
an adaptation of AF gain parameter are included to improve learning
capabilities. Also, modified versions of simple EBP, RPROP and LM
algorithms are tested. For this purpose, prediction of nonlinear chaotic
system is used.
Criteria used for the evaluation of learning algorithms which
influenced the neural network performance were efficiency and accuracy
of the neural network, with the emphasis on the accuracy, due to its
direct relation to the generalization capability. In our experiments,
Levenberg-Marquardt algorithm proved to be the best algorithm regarding
both criteria. Both versions of the RPROP and CG algorithm achieved
comparable results, whereas EBP turns out to be the algorithm with the
poorest learning and especially generalization capabilities.
Future work will be directed towards analysis of presented
algorithms and their modifications on different regression and
classification benchmark problems and will be published in our upcoming
publications.
7. REFERENCES
Hagan, T.M. & Menhaj, M.B. (1994). Training Feedforward Networks with the Marquardt Algorithm, IEEE Transactions on Neural
Networks, Vol. 5, No. 6, pp. 989-993, ISSN 1045-9227, November 1994
Igel, C. & Husken, M. (2000). Improving the Rprop Learning
Algorithm, Proceedings of the Second International Symposium on Neural
Computation, NC' 2000, Bothe, H. & Rojas, R.(Ed.), pp. 115-121,
ICSC Academic Press, 2000
Kasac, J.; Deur, J.; Novakovic, B. & Kolmanovsky, I.V. (2009).
A Conjugate Gradient-based BPYY-like Optimal Contol Algorithm, 3rd IEEE
Multi-conference on Systems and Control, ISSN 1085-1992, Saint
Petersburg, 2009
Lapedes, A. & Farber, R. (1987). Nonlinear Signal Processing
Using Neural Networks: Prediction and System Modeling, Techical Report,
Los Alamos National Laboratory, Los Alamos, New Mexico, 1987
Nguyen, D. & Widrow, B. (1990). Improving the Learning Speed of
2-Layer Neural Networks by Choosing Initial Values of the Adaptive
Weights, Proceedings of the International Joint Conference on Neural
Networks (IJCNN), Vol. 3, pp. 21-26, San Diego, CA, USA, 1990
Novakovic, B.; Majetic, D. & Siroki, M. (1998). Artificial
Neural Networks, Faculty of Mechanical Engineering and Naval
Architecture, ISBN 953-6313-17-0, Zagreb, Croatia
Tab. 1. Experimental results of a feedforward NN
six-step-ahead prediction
[NRMS.sub.test]
[NRMS.sub.learning] [NRMS.sub.test] step
EBP 0.0662 0.0936 34000
iRPROP- 0.0644 0.0834 16000
iP,PROP+ 0.0635 0.0834 19000
CG FL 0.0621 0.0831 12000
CG DY 0.0532 0.0823 12000
LM 0.0379 0.0745 6000