文章基本信息

标题：Application of reinforcement learning to a two DOF robot arm control.
作者：Albers, Albert ; Yan, Wenjie ; Frietsch, Markus 等
期刊名称：Annals of DAAAM & Proceedings
印刷版ISSN：1726-9679
出版年度：2009
期号：January
语种：English
出版社：DAAAM International Vienna
摘要：One of the biggest challenges in current research in robotics is, that robots "leave" their well structured environment and are confronted with new tasks in a more complex environment. Due to this, it can only be successful resp. useful, when it is able to adapt itself and learn from experiences. Reinforcement Learning (RL), a branch of machine learning (Mitchell & Tom, 1997), is one possible solution. RL is a learning process, which uses reward and punishment from the interaction with environment to learn a policy for achieving tasks. Various RL methods e.g. Q-learning (Watkins, 1989) have been studied in the recent decades and it is shown that two problems must be considered. At first, the high computational efforts: RL is disturbed by the "curse of dimensionality" (Bellman, 1957), which refers to the tendency of a state space to grow exponentially in its dimension, that is, in the number of state variables (Sutton & Barto, 1998). Secondly, a Q-table is created for one specific task. It requires an extreme large space to store policies for all possible tasks, which restricts the practical application of this learning method strongly.
关键词：Artificial intelligence;Degrees of freedom (Mechanics);Reinforcement learning (Machine learning);Robot arms;Robot motion;Robots

Application of reinforcement learning to a two DOF robot arm control.

Albers, Albert ; Yan, Wenjie ; Frietsch, Markus 等

1. INTRODUCTION

One of the biggest challenges in current research in robotics is, that robots "leave" their well structured environment and are confronted with new tasks in a more complex environment. Due to this, it can only be successful resp. useful, when it is able to adapt itself and learn from experiences. Reinforcement Learning (RL), a branch of machine learning (Mitchell & Tom, 1997), is one possible solution. RL is a learning process, which uses reward and punishment from the interaction with environment to learn a policy for achieving tasks. Various RL methods e.g. Q-learning (Watkins, 1989) have been studied in the recent decades and it is shown that two problems must be considered. At first, the high computational efforts: RL is disturbed by the "curse of dimensionality" (Bellman, 1957), which refers to the tendency of a state space to grow exponentially in its dimension, that is, in the number of state variables (Sutton & Barto, 1998). Secondly, a Q-table is created for one specific task. It requires an extreme large space to store policies for all possible tasks, which restricts the practical application of this learning method strongly.

In (Martin & De Lope, 2007), the author presents a distributed architecture in RL to solve the first problem, which uses some small, low dimensional Q-tables instead of a global high-dimensional Q-table with the evaluations of actions for all states. In this paper, another ways based on an optimized state space representation are proposed for improving the learning ability of RL

2. APPROACH

2.1 Overview

The schematic of robot arm with two degree of freedom is illustrated in Fig. 1. The system is dynamic and nonlinear because the inertia of the upper arm changes due to a variable angle [[theta].sub.2] (Denzinger & Laureyns, 2008).

[FIGURE 1 OMITTED]

Since the redundancy of the inverse kinematic, the angle set which is closer to the target state will be selected as end position. Because our work is focused on how to reach the goal, the motion accuracy of intermediate states is not interested thus the interval of a state close to the target could be much smaller than of a state far from it. A relative position is adopted to normalize the distribution of state space, whose values are always changed depending on the target position. Moreover, a fuzzy-logic system is integrated in the reward function for evaluating the executed actions and a coordinate transformation is applied to represent different tasks with one Q-table in library.

2.2 Unevenly distribution of state space

The state space of robot arm consists of its angles and angle velocities. A Q-table in this case is a 6-dimensional hyperspace, here means [theta] angle, [??] velocity and a action:

Q = {[[theta].sub.1], [[??].sub.1], [[theta].sub.2], [[??].sub.2], [a.sub.1], [a.sub.2]} (1)

According to the approach of (Martin & De Lope, 2007), we can decompose the Q-table in [Q.sub.1] = {[[theta].sub.1], [[??].sub.1], [a.sub.1]} and [Q.sub.2] = {[[theta].sub.2], [[??].sub.2], [a.sub.2]}. With the help of the fixed target position, each position can be calculated as a constant value [[theta].sub.t] plus a difference [DELTA][theta] as shown in the follow equations:

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (2)

If the state space is not built according to [theta] but to [DELTA][theta], the target position can always be located at ([DELTA][[theta].sub.1], [DELTA][[theta].sub.2]) = (0,0). In this case, the state space can be easily divided with different intervals.

[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII] (3)

A state space for a joint can be created with 5 x 7 = 35 elements and for both joints with 2 x 5 x 7 = 70 elements. In comparison, (2[pi]/0.1) x 5 x 2 [approximately equal to] 628 elements are needed to generate a conventional state space with fixed increments (we assume that [DELTA][theta] = 0.1 so as to get the same accuracy).

2.3 Q-table library

As mentioned before, the movement based on the relative position [DELTA][theta] is independent of the absolute position [theta]. For two movements with the same [DELTA][theta], we can change the coordinate system so as to make both of them the same. Because the dynamic model changes with cos [[theta].sub.2], it is necessary to save the Q-tables after [[theta].sub.2] or cos[[theta].sub.2]. By using ten discrete values for every dimension, a cubic shaped library with 1000 elements is created for different [DELTA][[theta].sub.1], [DELTA][[theta].sub.2], [[theta].sub.2]. The library is searched through these three values for a new task and the Q-table is loaded as initial policy in reinforcement learning.

2.4 Reward function with fuzzy logic system

A fuzzy logic (Michels & Kai, 2002) is an extension of boolean logic to multi-valued logic, which can be used to approximate complex function with simple and well understood control conditions. Because a reward function is composed of different parameters and challenging to describe in mathematic, a fuzzy logic system is integrated in the reward function so as to reduce the modeling expenses. The angle and velocity of each joint are defined as input fuzzy sets and the reward value as output. The reward values for both joints are calculated before summed to one. The fuzzy rule is depicted in Fig. 2 and the reward function of single joint in Fig. 3.

[FIGURE 2 OMITTED]

[FIGURE 3 OMITTED]

3. EXPERIMENTAL RESULTS

The experiments confirm the success of our approaches. The figures show the results of the operating process of performed simulations:

* State space with unevenly and evenly distribution of angle position with an increment of 0.1 rad (Fig. 4)

* Learning with and without the use of learned policy (Fig. 5)

[FIGURE 4 OMITTED]

[FIGURE 5 OMITTED]

4. CONCLUSION

This paper introduced novel approaches for optimizing the computational efforts in reinforcement learning by means of an unevenly distribution of state space and building the policy library so as to accelerate the learning process for new tasks. The experimental results showed that the learning performance can be improved significantly with this new approach. At the same time, the quality of calculated solutions is better than the solutions with conventional methods. For these reasons, we conclude that our approach is a suitable technique to enhance the learning ability of reinforcement learning for the given application. However, this method can be only used for pointto-point motion control and still difficult for online learning. The presented methods are currently being implemented on a higher DOF robot arm.

5. REFERENCES

Bellman, R. E. (1957). Dynamic Programming, ISBN: 978-0691079516, Princeton University Press, Princeton, NJ

Denzinger, J. & Laureyns, I. (2008). A study of reward functions in reinforcement learning on a dynamic model of a two-link planar robot, Proceedings of DAAAM 2008, ISBN: 978-3-901509-68-1, Vienna, Austria, Oct 2008

Martin H, J. A. & De Lope, J. (2007). A Distributed Reinforcement Learning Architecture for Multi-Link Robots, Proceedings of ICINCO 2007, Angers France, May 2007

Michels & Kai (2002). Fuzzy-Regelung: Grundlagen, Entwurf, Analyse, Springer, ISBN: 3540-43548-4, Berlin, Heidelberg

Mitchell, Tom M. (1997). Machine learning, McGraw-Hill, ISBN 0-07-042807-7, New York

Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learning: An Introduction, MIT Press, ISBN: 978-0-262-19398-6, Cambridge