Application of reinforcement learning to a two DOF robot arm control.
Albers, Albert ; Yan, Wenjie ; Frietsch, Markus 等
1. INTRODUCTION
One of the biggest challenges in current research in robotics is,
that robots "leave" their well structured environment and are
confronted with new tasks in a more complex environment. Due to this, it
can only be successful resp. useful, when it is able to adapt itself and
learn from experiences. Reinforcement Learning (RL), a branch of machine
learning (Mitchell & Tom, 1997), is one possible solution. RL is a
learning process, which uses reward and punishment from the interaction
with environment to learn a policy for achieving tasks. Various RL
methods e.g. Q-learning (Watkins, 1989) have been studied in the recent
decades and it is shown that two problems must be considered. At first,
the high computational efforts: RL is disturbed by the "curse of
dimensionality" (Bellman, 1957), which refers to the tendency of a
state space to grow exponentially in its dimension, that is, in the
number of state variables (Sutton & Barto, 1998). Secondly, a
Q-table is created for one specific task. It requires an extreme large
space to store policies for all possible tasks, which restricts the
practical application of this learning method strongly.
In (Martin & De Lope, 2007), the author presents a distributed
architecture in RL to solve the first problem, which uses some small,
low dimensional Q-tables instead of a global high-dimensional Q-table
with the evaluations of actions for all states. In this paper, another
ways based on an optimized state space representation are proposed for
improving the learning ability of RL
2. APPROACH
2.1 Overview
The schematic of robot arm with two degree of freedom is
illustrated in Fig. 1. The system is dynamic and nonlinear because the
inertia of the upper arm changes due to a variable angle [[theta].sub.2]
(Denzinger & Laureyns, 2008).
[FIGURE 1 OMITTED]
Since the redundancy of the inverse kinematic, the angle set which
is closer to the target state will be selected as end position. Because
our work is focused on how to reach the goal, the motion accuracy of
intermediate states is not interested thus the interval of a state close
to the target could be much smaller than of a state far from it. A
relative position is adopted to normalize the distribution of state
space, whose values are always changed depending on the target position.
Moreover, a fuzzy-logic system is integrated in the reward function for
evaluating the executed actions and a coordinate transformation is
applied to represent different tasks with one Q-table in library.
2.2 Unevenly distribution of state space
The state space of robot arm consists of its angles and angle
velocities. A Q-table in this case is a 6-dimensional hyperspace, here
means [theta] angle, [??] velocity and a action:
Q = {[[theta].sub.1], [[??].sub.1], [[theta].sub.2], [[??].sub.2],
[a.sub.1], [a.sub.2]} (1)
According to the approach of (Martin & De Lope, 2007), we can
decompose the Q-table in [Q.sub.1] = {[[theta].sub.1], [[??].sub.1],
[a.sub.1]} and [Q.sub.2] = {[[theta].sub.2], [[??].sub.2], [a.sub.2]}.
With the help of the fixed target position, each position can be
calculated as a constant value [[theta].sub.t] plus a difference
[DELTA][theta] as shown in the follow equations:
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)
If the state space is not built according to [theta] but to
[DELTA][theta], the target position can always be located at
([DELTA][[theta].sub.1], [DELTA][[theta].sub.2]) = (0,0). In this case,
the state space can be easily divided with different intervals.
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)
A state space for a joint can be created with 5 x 7 = 35 elements
and for both joints with 2 x 5 x 7 = 70 elements. In comparison,
(2[pi]/0.1) x 5 x 2 [approximately equal to] 628 elements are needed to
generate a conventional state space with fixed increments (we assume
that [DELTA][theta] = 0.1 so as to get the same accuracy).
2.3 Q-table library
As mentioned before, the movement based on the relative position
[DELTA][theta] is independent of the absolute position [theta]. For two
movements with the same [DELTA][theta], we can change the coordinate
system so as to make both of them the same. Because the dynamic model
changes with cos [[theta].sub.2], it is necessary to save the Q-tables
after [[theta].sub.2] or cos[[theta].sub.2]. By using ten discrete
values for every dimension, a cubic shaped library with 1000 elements is
created for different [DELTA][[theta].sub.1], [DELTA][[theta].sub.2],
[[theta].sub.2]. The library is searched through these three values for
a new task and the Q-table is loaded as initial policy in reinforcement
learning.
2.4 Reward function with fuzzy logic system
A fuzzy logic (Michels & Kai, 2002) is an extension of boolean
logic to multi-valued logic, which can be used to approximate complex
function with simple and well understood control conditions. Because a
reward function is composed of different parameters and challenging to
describe in mathematic, a fuzzy logic system is integrated in the reward
function so as to reduce the modeling expenses. The angle and velocity
of each joint are defined as input fuzzy sets and the reward value as
output. The reward values for both joints are calculated before summed
to one. The fuzzy rule is depicted in Fig. 2 and the reward function of
single joint in Fig. 3.
[FIGURE 2 OMITTED]
[FIGURE 3 OMITTED]
3. EXPERIMENTAL RESULTS
The experiments confirm the success of our approaches. The figures
show the results of the operating process of performed simulations:
* State space with unevenly and evenly distribution of angle
position with an increment of 0.1 rad (Fig. 4)
* Learning with and without the use of learned policy (Fig. 5)
[FIGURE 4 OMITTED]
[FIGURE 5 OMITTED]
4. CONCLUSION
This paper introduced novel approaches for optimizing the
computational efforts in reinforcement learning by means of an unevenly
distribution of state space and building the policy library so as to
accelerate the learning process for new tasks. The experimental results
showed that the learning performance can be improved significantly with
this new approach. At the same time, the quality of calculated solutions
is better than the solutions with conventional methods. For these
reasons, we conclude that our approach is a suitable technique to
enhance the learning ability of reinforcement learning for the given
application. However, this method can be only used for pointto-point
motion control and still difficult for online learning. The presented
methods are currently being implemented on a higher DOF robot arm.
5. REFERENCES
Bellman, R. E. (1957). Dynamic Programming, ISBN: 978-0691079516,
Princeton University Press, Princeton, NJ
Denzinger, J. & Laureyns, I. (2008). A study of reward
functions in reinforcement learning on a dynamic model of a two-link
planar robot, Proceedings of DAAAM 2008, ISBN: 978-3-901509-68-1,
Vienna, Austria, Oct 2008
Martin H, J. A. & De Lope, J. (2007). A Distributed
Reinforcement Learning Architecture for Multi-Link Robots, Proceedings
of ICINCO 2007, Angers France, May 2007
Michels & Kai (2002). Fuzzy-Regelung: Grundlagen, Entwurf,
Analyse, Springer, ISBN: 3540-43548-4, Berlin, Heidelberg
Mitchell, Tom M. (1997). Machine learning, McGraw-Hill, ISBN
0-07-042807-7, New York
Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learning: An
Introduction, MIT Press, ISBN: 978-0-262-19398-6, Cambridge