文章基本信息

标题：A study of reward functions in reinforcement learning on a dynamic model of a two-link planar robot.
作者：Denzinger, Joachim ; Laureyns, Isabelle ; Frietsch, Markus 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2008
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：One of the many approaches to successfully control robot manipulators is Reinforcement Learning (Sutton & Barto, 1998). Reinforcement Learning is based on an interaction between an agent (decision-maker) and an environment. The agent chooses an action, which is performed by the manipulator within the environment. The action results in a reward and, mapping rewards to the states and actions, it is possible to supply the agent with enough information to successfully deal with an unknown system at the beginning of the learning process. This is a major advantage of Reinforcement Learning. Because the agent seeks high rewards, it is capable of finding good solutions without the need to supply the information of how the goal can be achieved. It is the job of the agent to find a policy, [pi], mapping states to actions that maximize the reward in the long run, to reach the goal.
关键词：Robotics industry;Robots

A study of reward functions in reinforcement learning on a dynamic model of a two-link planar robot.

Denzinger, Joachim ; Laureyns, Isabelle ; Frietsch, Markus 等

1. INTRODUCTION

One of the many approaches to successfully control robot manipulators is Reinforcement Learning (Sutton & Barto, 1998). Reinforcement Learning is based on an interaction between an agent (decision-maker) and an environment. The agent chooses an action, which is performed by the manipulator within the environment. The action results in a reward and, mapping rewards to the states and actions, it is possible to supply the agent with enough information to successfully deal with an unknown system at the beginning of the learning process. This is a major advantage of Reinforcement Learning. Because the agent seeks high rewards, it is capable of finding good solutions without the need to supply the information of how the goal can be achieved. It is the job of the agent to find a policy, [pi], mapping states to actions that maximize the reward in the long run, to reach the goal.

In the work presented in this paper we use two methods in Reinforcement Learning, Q-Learning (Watkins, 1998), and SARSA (Rummery, 1994) to update the Q-table in order to learn the optimal policy for executing point-to-point movement in a two-link planar robot. The Q-table stores the expected return (Q-value or [Q.sup.[pi]](s,a)), when the agent takes action a in state s and follows the policy [pi].

In the case of multi-linked robots, dimensions of the Q-table for states and actions would grow exponentially and result in extremely slow convergence (large number of steps). This is called the "Curse of Dimensionality" (Bellman, 1954). To address this issue, Martin & de Lope (2007) proposed a distributed approach to Reinforcement Learning; instead of dealing with a huge state-action matrix and one agent, they use several agents, one for each actuator so that the table for the states and actions is reduced significantly. However, in their approach they use position for the states (inverse kinematics) without explicity considering the dynamic aspects such as damping or inertia; the effects of different reward functions were also not studied.

In this paper we actually use torques as the action-vector. This is motivated by the fact that humans predominantly use torques for controlling joints. Therefore, considering the dynamical model is crucial in developing control methods for Humanoid Robots. The two-link planar robot is a simplification of a human arm.

In this work, we identify the optimal combination of several reward functions that would fulfill the following requirements:

* A smooth velocity profile while approaching the target position is necessary to mimic human motion.

* No overshooting after reaching the target.

* Reaching the target in a minimum number of steps.

This paper is organized as follows. Section 2 introduces the two-link planar robot and the corresponding dynamic model. Section 3 describes the reward functions that are being used. We present the experimental results in section 4.

2. TWO-LINK PLANAR ROBOT

Figure 1 shows the model of the two-link robot that was used in the current work. The equations of motion were derived using Lagrange's method. For the simulation, the shoulder and the elbow link are allowed to move [+ or -] [pi] rad.

[FIGURE 1 OMITTED]

3. REWARD FUNCTIONS

In Reinforcement Learning, the reward that is used for updating the Q-table plays an important role in convergence and the learning rate. This behaviour is independent of the algorithm used (Q-Learning or SARSA). The reward given to the agent denotes the quality of the reached state as a result of the actions. All the following reward functions presented have one thing in common: the better the quality of the state, the higher the reward. The criteria for evaluating the states are based on the distance, the velocity, the direction, and an additional penalty. The reward based on velocity is calculated by taking the distance from the target into account. The penalty is applied when the state variable exceeds the maximal value (the robot moves beyond a prescribed boundary).

Distance

[r.sub.dst,p] = 1 - d/[d.sub.o] (1)

[r.sub.dst,e] = [beta]/1 + [(2d/[d.sub.o]).sup.n] (2)

Velocity

[r.sub.vel,p] = (1 - d/[d.sub.o])(1 - v/[v.sup.*]) (3)

[r.sub.vel,g] = exp (- 1/2 [(d/[d.sub.o]/[sigma]d).sup.2]) exp (- 1/2[(v/[v.sup.*]/[[sigma].sub.v]).sup.2]) (4)

[r.sub.vel,e] = (1/1 + [(d/[d.sub.0]).sup.n]) (1/1 + [(v/[v.sup.*]).sup.n]) (5)

Direction

[r.sub.dir,p] = 2 (0.5 - ([delta]/[pi])) (6)

[r.sub.dir,e] = 1/1 + [(2[delta]/[pi]).sup.2] (7)

Additional penalty

pen = -1/1 + exp (50[s.sub.i]/[s.sub.i,j] - 40) (8)

Each reward function was evaluated individually and then a combination that contains distance, velocity, direction and the penalty component was also evaluated. The composite reward function (r) is designed as a weighted sum of five of these components:

r = [w.sub.1][r.sub.vel,cc] + [w.sub.2] [r.sub.vel,g] + [w.sub.3] [r.sub.dst,e] + [w.sub.4][r.sub.dir,e] + [w.sub.5]pen (9)

In order to find the suitable weighting, a fully factorial experiment is carried out. Each partial reward function is assigned a weighting between 5% and 45% ([w.sub.i,min] [approximately equal to] 0.05; [W.sub.i,max] [approximately equal to] 0.45; i = 1,2, ... 5).

The values of [w.sub.1] ... [w.sub.4] are determined independently whereas the value of [w.sub.5] is calculated by enforcing the constraint [[summation].sup.5.sub.i] = 1 [w.sub.i] = 1. The values for [w.sub.i,min] and [w.sub.i,max] are used to generate uniformly distributed weights for the factorial experiment. Each factor ([w.sub.i]) is assigned 4 values between (and including) the minimal and maximal values.

[FIGURE 2 OMITTED]

4. EXPERIMENTAL RESULTS

A set of points within the workspace was randomly generated and provided as target positions for the RL algorithm. Figure 1 shows some of the randomly generated target positions. The performance (number of steps) of the RL algorithm in reaching these targets using each of the reward functions (eqns. 1-7) is shown in Table 1.

Using a four-level, four-factor fully factorial experiment and subsequent analysis of variance (ANOVA), the best reward function was obtained as:

[r.sub.opt] = 0.29 [r.sub.vel,cc] + 0.07 [r.sub.vel,g] + 0.34 [r.sub.dst,e] + 0.14[ r.sub.dire] + 0.16 pen (10)

The trajectory of the robot for six of the target positions using the optimal reward function (eq. 10) is shown in Figure 2. Each target was reached in a minimum number of steps, without overshoot and with a smooth velocity profile when reaching the target position (see Fig.2). The experiment details can be found in (Denzinger, 2008).

5. CONCLUSION

In this paper, we empirically obtain the optimal composition of a set of reward functions for a dynamic model of a two-link planar robot, where torque commands are used as part of the action-vector. These reward functions were evaluated based on fulfilling certain requirements, and repeated for several target positions. Based on this study, we find that these reward functions are dependent on the dynamic model of the robot. Future work consists of identifying reward functions that are independent of the underlying robot model. More investigation is also needed to include velocity profiles within the framework of Reinforcement Learning through appropriate reward functions. This is especially important for mimicking human behaviour.

7. REFERENCES

Denzinger, J. (2008) Implementation of distinct Reinforcement Learning algorithms for the control of a 2-DOF manipulator model, Diplomarbeit, Institut fur Produktentwicklung, Universitat Karlsruhe (TH).

Martin-H., J. A. and De-Lope, J. (2007) A Distributed Reinforcement Learning Architecture for Multi-Link Robots. In ICINCO 2007, pp. 192-197.

Rummery, G. A. and Niranjan, M. (1994) On-line Q-Learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Engineering Department, Cambridge University.

Sutton, R.S. and Barto, A.G. (1998). Reinforcement Learning, an introduction, The MIT press.

Watkins, C. J. and Dayan, P. (1992). Technical note Q-Learning. Machine Learning, 8:279.

DENZINGER, J[oachim]; LAUREYNS, I[sabelle]; FRIETSCH, M[arkus]; BURGER, W[olfgang] & MECKL, P[eter] *

* Supervisor, Mentor

Table 1. The number of steps taken to reach each target
position using each reward function.

Target Number of steps
Reward
Function 1 2 3 4 5 6

[r.sub.dst,p] 4 38 28 35 35 37
[r.sub.dst,e] 4 38 28 35 35 37
[r.sub.vel,p] 4 7 150 150 150 150
[r.sub.vel,g] 4 8 10 9 25 34
[r.sub.vel,e] 4 18 9 12 24 37
[r.sub.dir,p] 4 28 14 19 20 26
[r.sub.opt] 4 8 8 10 14 16