YOTO++: Learning Long-Horizon Closed-Loop Bimanual Manipulation from One-Shot Human Video Demonstrations an extended journal version of our conference paper YOTO in RSS'2025
- Huayi Zhou The Chinese University of Hong Kong, Shenzhen
- Ruixiang Wang Harbin Institute of Technology, Weihai
- Yunxin Tai DexForce, Shenzhen
Yueci Deng DexForce, Shenzhen- Guiliang Liu The Chinese University of Hong Kong, Shenzhen
- Kui Jia The Chinese University of Hong Kong, Shenzhen
Specifically, in this expansion YOTO++, we have added several new components to enhance its functionality and comprehensiveness. â‘ First, we incorporate a vision-based alignment module during the initial grasping phase of each task, enabling closed-loop control to robustly handle dynamic disturbances. â‘¡ Second, we extend YOTO to two additional long-horizon tasks involving tool use, demonstrating its ability to consistently extract and execute temporally coherent multi-stage actions. â‘¢ Third, we introduce three new bimanual manipulation tasks that encompass a broader range of primitive skills, thereby further validating the generalizability of the proposed framework. â‘£ Lastly, we validate the cross-embodiment adaptability of YOTO by deploying it on a humanoid dual-arm robot, showcasing its platform-agnostic nature and real-world applicability across diverse robotic morphologies.
Below are step-wise rollout examples of ten bimanual manipulation tasks.
Abstract
Bimanual robotic manipulation remains a fundamental challenge due to the inherent complexity of dual-arm coordination and high-dimensional action spaces. This paper presents the entended YOTO++ (You Only Teach Once), which is a unified one-shot learning framework for teaching bimanual skills directly from third-person human video demonstrations. Our method extracts structured 3D hand motions using binocular vision and distills them into compact, keyframe-based trajectories for dual-arm execution. We develop a scalable demonstration proliferation strategy that synthetically augments one-shot demonstrations into diverse training samples, enabling effective learning of a customized bimanual diffusion policy. Extensive evaluations across a broad spectrum of long-horizon bimanual tasks, including asynchronous, synchronous, contact-rich, and non-prehensile scenarios, demonstrate strong generalization to novel skills and objects. We further introduce a visual alignment mechanism at the initial manipulation stage for closed-loop control, enabling the system to dynamically adapt to perturbations during execution. We validate the framework on a new dual-arm robotic platform to show seamless cross-embodiment transfer without additional retraining. YOTO++ achieves impressive performance in accuracy, robustness, and scalability, advancing the practical deployment of general-purpose bimanual manipulation systems.
â–¶ Details of Our Extended YOTO++
Our proposed YOTO++ (You Only Teach Once) enables cross-embodiment deployment (from the contralateral to humanoid dual-arm setups), and facilitates diverse bimanual tasks including asynchronous, synchronous and tool-using scenarios, with closed-loop control under dynamic disturbances during pre-grasping. Notably, it needs only the one-shot observation of a third-person binocular camera to extract the fine-grained motion trajectory of human hands, which can then be utilized for the dual-arm coordinated action injection and rapid proliferation of training demonstrations.
For related object assets, we collected a variety of manipulated objects in instance-level for each of ten bimanual tasks to improve and verify the generalizability of trained policies. All of these objects are from everyday life, not intentionally customized.
Below is the visualization of ten bimanual tasks performed on real robots. We use different colors such as cyan, yellow and magenta to distinguish frames of left arm, right arm and both arms, respectively. Arrows are artificially added to show movement trends. It is best to zoom in to view the details.
â–¶ â‘ Visual Alignment Enabling Closed-Loop Pre-Grasping
We observe that once the target object has been securely grasped, the relative pose between the end-effector and object becomes fixed, reducing the necessity for high-frequency visual feedback. In this case, it is both safe and efficient to rely on either the initial demonstration-aligned keyframes or model-inferred trajectories for the subsequent execution. While, for the critical pre-grasping stage, disturbances in object can significantly impact the manipulation success during. To address this, we propose a lightweight visual alignment algorithm that enables closed-loop pre-grasping by aligning the current object pose with the initial demonstrated configuration.
Below is the example of dynamic interferences during the pre-grasping stage for tasks unscrew bottle (top row) and pour water (bottom row), where each object is manually disturbed with one, two or three times. The red arrow indicates the direction of the manually moved object (interfering). The cyan arrow and yellow arrow indicate the movement direction of the left and right robotic arms (chasing) respectively.
Unscrew Bottle | |
Pour Water | |
â–¶ â‘¡ Two Long-Horizon Tool-Using Bimanual Tasks
Learning to use both arms to exploit human-made tools (such as using spoon for scooping water out from the bowl, and using funnel for pouring water from the mug back into the bottle) to tackle more challenging tasks and skills is important, difficult, and interesting. Often, this requires a combination of multiple steps, essentially long-horizon dual-arm manipulation. Below are illustrations of two additional tool-using tasks. Note that a new perspective camera is selected here to record the hand movements, which does not affect the extraction and injection of hand trajectories.
Tool: Spoon | ||
Tool: Funnel | ||
Moreover, we selected a typical super long-horizon bimanual task (snack making) and enabled the dual-arm robot to learn new given goals quickly and easily through one-shot human teaching. Due to space limitations, we did not continue the demonstration proliferation and policy training. The illustrations of extracted actions that can be injected into real robots are shown in below. These results further reveal the simplicity, versatility and scalability of YOTO.
Stage 1: unscrewing bottle + pouring water | ||
Stage 2: scooping peanut + dumping plum | ||
Stage 3: unfolding cloth + stirring liquid | ||
Stage 4: bi-holding bowl + handover bowl | ||
â–¶ â‘¢ Three New Bimanual Manipulation Tasks
In addition to the five bimanual manipulation tasks mentioned previously in conference paper (including pull drawer, pour water, unscrew bottle, uncover lid and open box), we have added three new dual-arm atomic skills scuh as insert pen, reorient board and flip basket. Below are illustrations of them.
Insert Pen | ||
Reorient Board | ||
Flip Basket | ||
Here we continue to show the real robot rollouts of the three newly added bimanual tasks on different new instances. Note that a new perspective camera is selected here to record videos of task reorient board due to unexpected circumstances in subsequent transportation and hardware updates.
Insert Pen (new instances such as paired spoons / forks) | |
Reorient Board (new instances such as spoons / shovels) | |
Flip Basket (new instances such as white basket / gray pillow) | |
â–¶ â‘£ Transfer YOTO into A Humanoid Dual-Arm Robot
Our YOTO is inherently hardware-agnostic by design. Since human-demonstrated dual-hand trajectories are extracted and encoded in a robot-agnostic space, they can be injected into any dual-arm robotic system as long as the actions remain within its reachable workspace. To validate this, we deploy YOTO on a structurally different humanoid dual-arm robot, which features an anthropomorphic layout more common in general-purpose platforms.
Below are illustrations of two selected bimanual tasks (unscrew bottle and pour water) transferred to the humanoid robot. Top Row: the visualization of hand motions extraction. Bottom Row: the corresponding rollout examples by injecting actions on real robots.
Unscrew Bottle (Humanoid) | ||
Pour Water (Humanoid) | ||
Furthermore, in scenarios where object variation is limited (i.e., intra-instance consistency), we observe that the proposed visual alignment module remains effective. As shown in below, YOTO can still achieve the train-free closed-loop pre-grasping, followed by direct replay of the demonstrated action sequence, completing the task without additional adaptation. These results provide strong empirical evidence for its cross-embodiment generality and practical deployability across diverse dual-arm robotic systems.
Unscrew Bottle / Pour Water (Closed-Loop) | |
Citation
Acknowledgements
We acknowledge the providers of various hardware used in this project, including the Aubo-i5 robotic arm, Estun ER7 robotic arm, DH gripper PGI-80-80, Jodell RG75, and kingfisher binocular camera.
The website template was borrowed from Jon Barron and Zip-NeRF.