Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

1Syracuse University, 2Midea Group, 3Shanghai University, 4East China Normal University, 5Beijing Innovation Center of Humanoid Robotics,
*Equal contribution

Abstract

Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose Discrete Policy, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26% higher than Diffusion Policy and 15\% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.

Overview of Discrete Policy

Discrete Policy: In the first training stage, as indicated by the green arrow, we train a VQ-VAE that maps actions into discrete latent space with an encoder and then reconstructs the actions based on the latent embeddings using a decoder. In the second training stage, as indicated by the brown arrow, we train a latent diffusion model that predicts task-specific latent embeddings to guide the decoder in predicting accurate actions.


Experiments

We ask the following key questions about the effectiveness of our algorithm: 1) Can Discrete Policy be effectively deployed to real-world scenarios? 2) Can Discrete Policy be scaled up to multiple complex tasks? 3) Can Discrete Policy effectively distinguish different behavioral modalities across multiple tasks? In order to answer these questions, we built a real-world robotic arm environment, designed a variety of manipulation tasks that contain rich skill requirements and long-horizon tasks, and finally conducted extensive experiments.

Task Description for Real-World Experiments

For the single-arm Franka experiments, we designed two multi-task evaluation protocols named Multi-Task 5 (MT-5) and Multi-Task 12 (MT-12).
MT-5 contains 5 tasks: 1) PlaceTennis. 2) OpenDrawer. 3) OpenBox. 4) StackBlock. 5) UprightMug.
MT-12 extends MT-5's task range to 12 tasks, including 3 long-horizon tasks, more varied scenarios, and more complex skill requirements such as flip, press, and pull.
The 12 tasks are: 1) PlaceTennis. 2) OpenDrawer. 3) OpenBox. 4) StackBlock. 5) UprightMug. 6) CloseDrawer. 7) PlaceCan. 8) ArrangeFlower. 9) CloseMicrowave. 10) InsertToast. 11) StoreToyCar. 12) DisposePaper.

For the bimanual UR5 robotic experiments, we designed 6 challenging tasks that require collaboration between two robotic arms.
The 6 tasks are: 1) TennisBallPack. 2) BreadTransfer. 3) StackBlock. 4) BreadDrawer. 5) SweepPaper-1. 6) SweepPaper-2.

Quantitative Comparison

Results on Single-arm Franka Robot.

The figures on the left and right show the success rates on the MT-5 and MT-12, respectively.

Comparisons on Multi-Task 5 (MT-5) in the single Franka robot, with success rates reported and the best results highlighted in bold. The symbol * denotes that methods are pretrained by 970K OpenX robot data.
Method PlaceTennis OpenDrawer OpenBox StackBlock UprightMug Average
RT-1 25 30 30 10 15 22
BeT 45 30 65 30 15 37
BESO 40 30 55 25 15 33
MDT 50 35 55 20 25 37
Octo* 40 55 50 30 40 43
OpenVLA* 85 80 90 40 50 69
MT-ACT 80 80 100 55 45 59
Diffusion Policy 60 55 80 50 45 58
Discrete Policy 85 90 100 75 70 84

Results on Bimanual UR5 robot.

Comparing Discrete Policy with baseline methods on six bimanual UR5 robot tasks.
Method TennisBallPack BreadTransfer StackBlock BreadDrawer SweepPaper-1 SweepPaper-2 Average
BeT 30 10 0 30 40 20 21.7
MT-ACT 70 40 10 70 80 60 55.0
Diffusion Policy 30 35 0 45 65 50 37.5
Discrete Policy 70 55 30 85 85 75 65.8

Visualization of Discrete Policy

The t-SNE visualization of feature embeddings from Discrete Policy reveals that skills across different tasks cluster closely together. This pattern suggests that discrete latent spaces are capable of disentangling the complex, multimodal action distributions encountered in multi-task policy learning.


Qualitative Comparison

We qualitatively compare discrete policy with diffusion policy and show some cases where diffusion policy fails.

Discrete Policy (Ours)

Diffusion Policy

BibTeX

@article{wu2024discrete,
      title={Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation},
      author={Wu, Kun and Zhu, Yichen and Li, Jinming and Wen, Junjie and Liu, Ning and Xu, Zhiyuan and Qiu, Qinru and Tang, Jian},
      journal={arXiv preprint arXiv:2409.18707},
      year={2024}
    }