A study of reward functions in reinforcement learning on a dynamic model of a two-link planar robot.
Denzinger, Joachim ; Laureyns, Isabelle ; Frietsch, Markus 等
1. INTRODUCTION
One of the many approaches to successfully control robot
manipulators is Reinforcement Learning (Sutton & Barto, 1998).
Reinforcement Learning is based on an interaction between an agent
(decision-maker) and an environment. The agent chooses an action, which
is performed by the manipulator within the environment. The action
results in a reward and, mapping rewards to the states and actions, it
is possible to supply the agent with enough information to successfully
deal with an unknown system at the beginning of the learning process.
This is a major advantage of Reinforcement Learning. Because the agent
seeks high rewards, it is capable of finding good solutions without the
need to supply the information of how the goal can be achieved. It is
the job of the agent to find a policy, [pi], mapping states to actions
that maximize the reward in the long run, to reach the goal.
In the work presented in this paper we use two methods in
Reinforcement Learning, Q-Learning (Watkins, 1998), and SARSA (Rummery,
1994) to update the Q-table in order to learn the optimal policy for
executing point-to-point movement in a two-link planar robot. The
Q-table stores the expected return (Q-value or [Q.sup.[pi]](s,a)), when
the agent takes action a in state s and follows the policy [pi].
In the case of multi-linked robots, dimensions of the Q-table for
states and actions would grow exponentially and result in extremely slow
convergence (large number of steps). This is called the "Curse of
Dimensionality" (Bellman, 1954). To address this issue, Martin
& de Lope (2007) proposed a distributed approach to Reinforcement
Learning; instead of dealing with a huge state-action matrix and one
agent, they use several agents, one for each actuator so that the table
for the states and actions is reduced significantly. However, in their
approach they use position for the states (inverse kinematics) without
explicity considering the dynamic aspects such as damping or inertia;
the effects of different reward functions were also not studied.
In this paper we actually use torques as the action-vector. This is
motivated by the fact that humans predominantly use torques for
controlling joints. Therefore, considering the dynamical model is
crucial in developing control methods for Humanoid Robots. The two-link
planar robot is a simplification of a human arm.
In this work, we identify the optimal combination of several reward
functions that would fulfill the following requirements:
* A smooth velocity profile while approaching the target position
is necessary to mimic human motion.
* No overshooting after reaching the target.
* Reaching the target in a minimum number of steps.
This paper is organized as follows. Section 2 introduces the
two-link planar robot and the corresponding dynamic model. Section 3
describes the reward functions that are being used. We present the
experimental results in section 4.
2. TWO-LINK PLANAR ROBOT
Figure 1 shows the model of the two-link robot that was used in the
current work. The equations of motion were derived using Lagrange's
method. For the simulation, the shoulder and the elbow link are allowed
to move [+ or -] [pi] rad.
[FIGURE 1 OMITTED]
3. REWARD FUNCTIONS
In Reinforcement Learning, the reward that is used for updating the
Q-table plays an important role in convergence and the learning rate.
This behaviour is independent of the algorithm used (Q-Learning or
SARSA). The reward given to the agent denotes the quality of the reached
state as a result of the actions. All the following reward functions
presented have one thing in common: the better the quality of the state,
the higher the reward. The criteria for evaluating the states are based
on the distance, the velocity, the direction, and an additional penalty.
The reward based on velocity is calculated by taking the distance from
the target into account. The penalty is applied when the state variable
exceeds the maximal value (the robot moves beyond a prescribed
boundary).
Distance
[r.sub.dst,p] = 1 - d/[d.sub.o] (1)
[r.sub.dst,e] = [beta]/1 + [(2d/[d.sub.o]).sup.n] (2)
Velocity
[r.sub.vel,p] = (1 - d/[d.sub.o])(1 - v/[v.sup.*]) (3)
[r.sub.vel,g] = exp (- 1/2 [(d/[d.sub.o]/[sigma]d).sup.2]) exp (-
1/2[(v/[v.sup.*]/[[sigma].sub.v]).sup.2]) (4)
[r.sub.vel,e] = (1/1 + [(d/[d.sub.0]).sup.n]) (1/1 +
[(v/[v.sup.*]).sup.n]) (5)
Direction
[r.sub.dir,p] = 2 (0.5 - ([delta]/[pi])) (6)
[r.sub.dir,e] = 1/1 + [(2[delta]/[pi]).sup.2] (7)
Additional penalty
pen = -1/1 + exp (50[s.sub.i]/[s.sub.i,j] - 40) (8)
Each reward function was evaluated individually and then a
combination that contains distance, velocity, direction and the penalty
component was also evaluated. The composite reward function (r) is
designed as a weighted sum of five of these components:
r = [w.sub.1][r.sub.vel,cc] + [w.sub.2] [r.sub.vel,g] + [w.sub.3]
[r.sub.dst,e] + [w.sub.4][r.sub.dir,e] + [w.sub.5]pen (9)
In order to find the suitable weighting, a fully factorial
experiment is carried out. Each partial reward function is assigned a
weighting between 5% and 45% ([w.sub.i,min] [approximately equal to]
0.05; [W.sub.i,max] [approximately equal to] 0.45; i = 1,2, ... 5).
The values of [w.sub.1] ... [w.sub.4] are determined independently
whereas the value of [w.sub.5] is calculated by enforcing the constraint
[[summation].sup.5.sub.i] = 1 [w.sub.i] = 1. The values for
[w.sub.i,min] and [w.sub.i,max] are used to generate uniformly
distributed weights for the factorial experiment. Each factor
([w.sub.i]) is assigned 4 values between (and including) the minimal and
maximal values.
[FIGURE 2 OMITTED]
4. EXPERIMENTAL RESULTS
A set of points within the workspace was randomly generated and
provided as target positions for the RL algorithm. Figure 1 shows some
of the randomly generated target positions. The performance (number of
steps) of the RL algorithm in reaching these targets using each of the
reward functions (eqns. 1-7) is shown in Table 1.
Using a four-level, four-factor fully factorial experiment and
subsequent analysis of variance (ANOVA), the best reward function was
obtained as:
[r.sub.opt] = 0.29 [r.sub.vel,cc] + 0.07 [r.sub.vel,g] + 0.34
[r.sub.dst,e] + 0.14[ r.sub.dire] + 0.16 pen (10)
The trajectory of the robot for six of the target positions using
the optimal reward function (eq. 10) is shown in Figure 2. Each target
was reached in a minimum number of steps, without overshoot and with a
smooth velocity profile when reaching the target position (see Fig.2).
The experiment details can be found in (Denzinger, 2008).
5. CONCLUSION
In this paper, we empirically obtain the optimal composition of a
set of reward functions for a dynamic model of a two-link planar robot,
where torque commands are used as part of the action-vector. These
reward functions were evaluated based on fulfilling certain
requirements, and repeated for several target positions. Based on this
study, we find that these reward functions are dependent on the dynamic
model of the robot. Future work consists of identifying reward functions
that are independent of the underlying robot model. More investigation
is also needed to include velocity profiles within the framework of
Reinforcement Learning through appropriate reward functions. This is
especially important for mimicking human behaviour.
7. REFERENCES
Denzinger, J. (2008) Implementation of distinct Reinforcement
Learning algorithms for the control of a 2-DOF manipulator model,
Diplomarbeit, Institut fur Produktentwicklung, Universitat Karlsruhe
(TH).
Martin-H., J. A. and De-Lope, J. (2007) A Distributed Reinforcement
Learning Architecture for Multi-Link Robots. In ICINCO 2007, pp.
192-197.
Rummery, G. A. and Niranjan, M. (1994) On-line Q-Learning using
connectionist systems. Technical Report CUED/F-INFENG/TR 166,
Engineering Department, Cambridge University.
Sutton, R.S. and Barto, A.G. (1998). Reinforcement Learning, an
introduction, The MIT press.
Watkins, C. J. and Dayan, P. (1992). Technical note Q-Learning.
Machine Learning, 8:279.
DENZINGER, J[oachim]; LAUREYNS, I[sabelle]; FRIETSCH, M[arkus];
BURGER, W[olfgang] & MECKL, P[eter] *
* Supervisor, Mentor
Table 1. The number of steps taken to reach each target
position using each reward function.
Target Number of steps
Reward
Function 1 2 3 4 5 6
[r.sub.dst,p] 4 38 28 35 35 37
[r.sub.dst,e] 4 38 28 35 35 37
[r.sub.vel,p] 4 7 150 150 150 150
[r.sub.vel,g] 4 8 10 9 25 34
[r.sub.vel,e] 4 18 9 12 24 37
[r.sub.dir,p] 4 28 14 19 20 26
[r.sub.opt] 4 8 8 10 14 16