Behavior Cloning for Deep Reinforcement Learning Trajectory Generation

Motion planning for robotic arms is much more computationally intensive than one would initially realize. When you move your arm around, you do not need to actively think about how to move your individual joints to reach an end position while avoiding obstacles because our brains are very efficient at motion planning. Robotic arms handle motion planning differently than humans. Before the robot moves, the motion planner has already calculated all states between the start and end position. The computational difficulty in this approach comes from the infinite number of possible joint positions for the arm to be in between the start and end goal. Therefore, search over this space would be extremely inefficient. Consequently, motion planners simplify the problem by discretizing the possible arm positions to facilitate efficient motion planning. By doing this, we limit the positions the arm can take to positions that are within this discretization. This approach works well for free space motion when the arm does not need to plan around a cluttered scene, but often struggles to compute trajectories for tightly constrained spaces, such as if the arm is in a tight passageway. If the discretization is too course, a solution may not be possible for standard motion planners. If we make the discretization more fine, then computation will be exponentially longer.

To get around these shortcomings, it is helpful to use a different approach such as reinforcement learning. The reinforcement learning agent plans trajectories through generation of intermediate states by prediction of continuous joint increments. The joint trajectory is generated by repeatedly observing the environment, planning small, continuous joint updates, then executing the update. The process of planning joint updates is done using a deep neural network, that learns through trial and error how to navigate around the environment. The plans taken by the arm are judged according to a cost function, and a reward is given to the arm in accordance with the optimality of the trajectory taken by the robot. The neural network adjusts its predictions to obtain the best possible reward.

hrlmp_simple_2.png

With a reinforcement learning approach to trajectory generation for 6+ degree of freedom arms, training times can often be very long. Therefore, we can apply our existing search-based motion planning applications to improve training times for the reinforcement learning algorithm. When the robotic arm starts the learning process, it wiggles around randomly. It is awarded a reward according to the quality of the trajectories generated from the random wiggling. To find a valid trajectory, the arm will have to wiggle in just the right way to reach the waypoint, obtaining a large amount of rewards. Due to random exploration process, the system takes a very long time to train, on the order of magnitude of several days depending on hardware. However, we have access to a plethora of valid trajectories provided by planners like TrajOpt and OMPL. The reinforcement learning agent will attempt to imitate the actions taken by the motion planners, with the addition of some exploratory noise, and reach a valid path much sooner than through chance.

The examples of valid trajectories provided by the motion planners are used to train the actor network of the reinforcement learning neural network in a supervised fashion. With the actor network learning from trajectories generated from graph search techniques, the weaknesses of the motion planners comes into play; TrajOpt generally takes a very long time to generate valid trajectories and OMPL algorithms such as RRT are non-deterministic. Due to these weaknesses, we cannot rely on the examples provided by the motion planners to train the actor network entirely in a supervised fashion. Instead, we train the actor network on examples provided by the motion planners then switch to training the network through standard reinforcement learning exploration methods. Training is able to benefit through the learning on the expert motion plans by imitating the actions of the motion planners, and learning how to improve the imitated trajectories to obtain more rewards.

This early work shows promise relative to levaraging reinforcement learning for motion planning. A number of improvements are slated both in the near term and long term, in particularly, to enable leverage in meaningful industrial confined space applications. One such improvement is the general restructuring of the core components of the motion planner. The reward function and visualization logic need to be decoupled from the environment simulator. Some of the hyperparameters should be renamed so that they are more conceptually clear and concise.

Furthermore, training time, which takes approximately 3-5 days of compute, can be dramatically decreased by dividing training into multiple tasks which can be computed in parallel by a specialized deep learning computer with multiple GPUs. By leveraging four GPUs, we estimate that the training process could be cut down to 24 hours.

Farther out it is hoped to leverage a different neural network architecture to handle the arbitrary nature of meshes and point clouds that dynamic systems will inevitably encounter, that drive the system to have to learn explicitly, and the current neural network cannot handle these elements of the system.

There remains a large amount of opportunity in this space, and the idea of a hybrid optimization-based and learning-based motion planning framwork offers a balance of solution that has the promise to enable precision motion planning applications without either driving excessive planning times, or invalid solutions.

Stay tuned as we look forward to sharing more on this and other motion planning topics, as we seek to further the state of the art nad provide compelling capabilities in the open source motion planning framework that are enabling advanced applications every day.