You Only Teach Once: Learn One-Shot Bimanual Robotic Manipulation from Video Demonstrations arXiv 2025.01 (under review)
- Huayi Zhou The Chinese University of Hong Kong, Shenzhen
- Ruixiang Wang Harbin Institute of Technology, Weihai
- Yunxin Tai DexForce, Shenzhen
Yueci Deng DexForce, Shenzhen- Guiliang Liu The Chinese University of Hong Kong, Shenzhen
- Kui Jia The Chinese University of Hong Kong, Shenzhen
Abstract
Bimanual robotic manipulation is a long-standing challenge of embodied intelligence due to its characteristics of dual-arm spatial-temporal coordination and high-dimensional action spaces. Previous studies rely on pre-defined action taxonomies or direct teleoperation to alleviate or circumvent these issues, often making them lack simplicity, versatility and scalability. Differently, we believe that the most effective and efficient way for teaching bimanual manipulation is learning from human demonstrated videos, where rich features such as spatial-temporal positions, dynamic postures, interaction states and dexterous transitions are available almost for free. In this work, we propose the YOTO (You Only Teach Once), which can extract and then inject patterns of bimanual actions from as few as a single binocular observation of hand movements, and teach dual robot arms various complex tasks. Furthermore, based on keyframes-based motion trajectories, we devise a subtle solution for rapidly generating training demonstrations with diverse variations of manipulated objects and their locations. These data can then be used to learn a customized bimanual diffusion policy (BiDP) across diverse scenes. In experiments, YOTO achieves impressive performance in mimicking 5 intricate long-horizon bimanual tasks, possesses strong generalization under different visual and spatial conditions, and outperforms existing visuomotor imitation learning methods in accuracy and efficiency.
▶ Details and Framework of Our YOTO
Our proposed YOTO (You Only Teach Once) facilitates various complex long-horizon bimanual tasks. It needs only a one-shot observation of a single third-person binocular camera to extract the fine-grained motion trajectory of human hands, which can then be utilized for the dual-arm coordinated action injection and rapid proliferation of training demonstrations.
The overview of our proposed YOTO. It is a general framework consists of three main modules: (a) the human hand motion extraction and injection, (b) the training demonstration proliferation from one-shot teaching, and (c) the training and deployment of a customized bimanual diffusion policy (BiDP). It is best to zoom in to view the details.
▶ Hand Motion Extraction and Injection
We focus on understanding human hands, including their location, left-rightness, 3D shape, joints, pose, contact, and open/closed state. These features can be perceived using hand-related vision methods. After extracting hand motion trajectories, we do not simply inject step-wise actions into robots, but choose to simplify the consecutive trajectory into discrete keyframes, and assign the corresponding keyposes to two arms to execute by applying inverse kinematics interpolation. Besides, we also record and replay the order of dual-hand movements (termed as motion mask), which can help to address the dual-arm coordination issue in long-horizon bimanual tasks. Now, we successfully obtain a stable and refined manipulation motion exemplar.
Below is illustrations of five major bimanual tasks: pull drawer, pour water, unscrew bottle, uncover lid and open box.
Pull Drawer | ||
Pour Water | ||
Unscrew Bottle | ||
Uncover Lid | ||
Open Box | ||
Another two new bimanual tasks reorient pen and flip basket are also demonstrated.
Reorient Pen | ||
Flip Basket | ||
▶ Auto-Rollout Verification in Real-World
Based on the one-shot teaching, we propose two demonstration proliferation schemes, the automatic rollout verification of real robots and point cloud-level geometry augmentation of manipulated objects. This solution is an efficient and reliable route to quickly produce training data for imitation learning. Below is the example showing how to conduct the automatic rollout verification.
▶ Qualitative Evaluation Results of BiDP
We evaluate YOTO on five real-world bimanual tasks, including pull drawer, pour water, unscrew bottle, uncover lid and open box. These tasks collectively encompass two types of dual-arm collaborations: strictly asynchronous and synchronous. The manipulated objects in these tasks might be rigid, articulated, deformable or non-prehensile. They also involve many primitive skills such as pull/push, pick/place, re-orient, unscrew, revolve and lift up. Some skills must require both arms to complete. More importantly, all tasks are long-horizon, indicating that they are quite complex due to containing multiple substeps.
Below, we show qualitative rollout samples from the third-person perspective for all evaluation tasks we mentioned in the main paper. We can observe our model’s generalization to object category and location variations. These examples show more complete scenes and the motion of two robot arms, and can be considered as a supplement to the limited field of view of the binocular observation camera. Note that these third-person video recordings do not participate in any training and testing.
▶ Failure Cases
Although our method BiDP outperforms many strong baselines for addressing long-horizon bimanual manipulation tasks, it still presents various failure cases during evaluation. Below, we focus our analysis on failure of BiDP in real-world experiments, and show some representative failure examples of all real robot executions we have performed with our method.
Pull Drawer | ||
Pour Water | ||
Unscrew Bottle | ||
Uncover Lid | ||
Open Box | ||
Citation
Acknowledgements
We acknowledge the providers of various hardware used in this project, including the Aubo-i5 robotic arm, DH gripper PGI-80-80, and kingfisher binocular camera.
The website template was borrowed from Jon Barron and Zip-NeRF.