Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Learning visuomotor policy for multi-task robotic manipulation has been a long-standing challenge for the robotics community. The difficulty lies in the diversity of action space: typically, a goal can be accomplished in multiple ways, resulting in a multimodal action distribution for a single task. The complexity of action distribution escalates as the number of tasks increases. In this work, we propose Discrete Policy, a robot learning method for training universal agents capable of multi-task manipulation skills. Discrete Policy employs vector quantization to map action sequences into a discrete latent space, facilitating the learning of task-specific codes. These codes are then reconstructed into the action space conditioned on observations and language instruction. We evaluate our method on both simulation and multiple real-world embodiments, including both single-arm and bimanual robot settings. We demonstrate that our proposed Discrete Policy outperforms a well-established Diffusion Policy baseline and many state-of-the-art approaches, including ACT, Octo, and OpenVLA. For example, in a real-world multi-task training setting with five tasks, Discrete Policy achieves an average success rate that is 26% higher than Diffusion Policy and 15\% higher than OpenVLA. As the number of tasks increases to 12, the performance gap between Discrete Policy and Diffusion Policy widens to 32.5%, further showcasing the advantages of our approach. Our work empirically demonstrates that learning multi-task policies within the latent space is a vital step toward achieving general-purpose agents.

Discrete Policy: In the first training stage, as indicated by the green arrow, we train a VQ-VAE that maps actions into discrete latent space with an encoder and then reconstructs the actions based on the latent embeddings using a decoder. In the second training stage, as indicated by the brown arrow, we train a latent diffusion model that predicts task-specific latent embeddings to guide the decoder in predicting accurate actions.

We ask the following key questions about the effectiveness of our algorithm: 1) Can Discrete Policy be effectively deployed to real-world scenarios? 2) Can Discrete Policy be scaled up to multiple complex tasks? 3) Can Discrete Policy effectively distinguish different behavioral modalities across multiple tasks? In order to answer these questions, we built a real-world robotic arm environment, designed a variety of manipulation tasks that contain rich skill requirements and long-horizon tasks, and finally conducted extensive experiments.

For the single-arm Franka experiments, we designed two multi-task evaluation protocols named Multi-Task 5 (MT-5) and Multi-Task 12 (MT-12).
MT-5 contains 5 tasks: 1) PlaceTennis. 2) OpenDrawer. 3) OpenBox. 4) StackBlock. 5) UprightMug.
MT-12 extends MT-5's task range to 12 tasks, including 3 long-horizon tasks, more varied scenarios, and more complex skill requirements such as flip, press, and pull.
The 12 tasks are: 1) PlaceTennis. 2) OpenDrawer. 3) OpenBox. 4) StackBlock. 5) UprightMug. 6) CloseDrawer. 7) PlaceCan. 8) ArrangeFlower. 9) CloseMicrowave. 10) InsertToast. 11) StoreToyCar. 12) DisposePaper.

For the bimanual UR5 robotic experiments, we designed 6 challenging tasks that require collaboration between two robotic arms.
The 6 tasks are: 1) TennisBallPack. 2) BreadTransfer. 3) StackBlock. 4) BreadDrawer. 5) SweepPaper-1. 6) SweepPaper-2.

Results on Single-arm Franka Robot.

The figures on the left and right show the success rates on the MT-5 and MT-12, respectively.

Comparisons on Multi-Task 5 (MT-5) in the single Franka robot, with success rates reported and the best results highlighted in bold. The symbol * denotes that methods are pretrained by 970K OpenX robot data.
Method	PlaceTennis	OpenDrawer	OpenBox	StackBlock	UprightMug	Average
RT-1	25	30	30	10	15	22
BeT	45	30	65	30	15	37
BESO	40	30	55	25	15	33
MDT	50	35	55	20	25	37
Octo^*	40	55	50	30	40	43
OpenVLA^*	85	80	90	40	50	69
MT-ACT	80	80	100	55	45	59
Diffusion Policy	60	55	80	50	45	58
Discrete Policy	85	90	100	75	70	84

Results on Bimanual UR5 robot.

Comparing Discrete Policy with baseline methods on six bimanual UR5 robot tasks.
Method	TennisBallPack	BreadTransfer	StackBlock	BreadDrawer	SweepPaper-1	SweepPaper-2	Average
BeT	30	10	0	30	40	20	21.7
MT-ACT	70	40	10	70	80	60	55.0
Diffusion Policy	30	35	0	45	65	50	37.5
Discrete Policy	70	55	30	85	85	75	65.8

The t-SNE visualization of feature embeddings from Discrete Policy reveals that skills across different tasks cluster closely together. This pattern suggests that discrete latent spaces are capable of disentangling the complex, multimodal action distributions encountered in multi-task policy learning.

We qualitatively compare discrete policy with diffusion policy and show some cases where diffusion policy fails.

Discrete Policy (Ours)

Diffusion Policy

BibTeX

@article{wu2024discrete,
      title={Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation},
      author={Wu, Kun and Zhu, Yichen and Li, Jinming and Wen, Junjie and Liu, Ning and Xu, Zhiyuan and Qiu, Qinru and Tang, Jian},
      journal={arXiv preprint arXiv:2409.18707},
      year={2024}
    }

Discrete Policy: Learning Disentangled Action Space for Multi-Task Robotic Manipulation

Abstract

Overview of Discrete Policy

Experiments

Task Description for Real-World Experiments

Quantitative Comparison

Visualization of Discrete Policy

Qualitative Comparison

BibTeX