One-Shot Real-World Demonstration Synthesis for Scalable Bimanual Manipulation
- Huayi Zhou The Chinese University of Hong Kong, Shenzhen
- Kui Jia The Chinese University of Hong Kong, Shenzhen; DexForce, Shenzhen
Abstract
Learning dexterous bimanual manipulation policies critically depends on large-scale, high-quality demonstrations, yet current paradigms face inherent trade-offs: teleoperation provides physically grounded data but is prohibitively labor-intensive, while simulation-based synthesis scales efficiently but suffers from sim-to-real gaps. We present BiDemoSyn, a framework that synthesizes contact-rich, physically feasible bimanual demonstrations from a single real-world example. The key idea is to decompose tasks into invariant coordination blocks and variable, object-dependent adjustments, then adapt them through vision-guided alignment and lightweight trajectory optimization. This enables the generation of thousands of diverse and feasible demonstrations within several hour, without repeated teleoperation or reliance on imperfect simulation. Across six dual-arm tasks, we show that policies trained on BiDemoSyn data generalize robustly to novel object poses and shapes, significantly outperforming recent baselines. By bridging the gap between efficiency and real-world fidelity, BiDemoSyn provides a scalable path toward practical imitation learning for complex bimanual manipulation without compromising physical grounding.
â–¶ Framework and Overview of BiDemoSyn
Our contributions about the proposed BiDemoSyn are threefold: (1) One-Shot Synthesis Framework: A systematic pipeline combining task decomposition, vision-guided adaptation, and contact-aware trajectory optimization to generate scalable real-world bimanual demonstrations. (2) Reality-Grounded Data Generation: A completely simulator-free method for synthesizing bimanual demonstrations, ensuring physical fidelity by construction. (3) Empirical Validation in Complex Tasks: Comprehensive real-robot experiments demonstrating significant improvements in policy robustness and cross-configuration generalization on various bimanual manipulation tasks.
From One to Many â‘ â¢â“ƒ. Taking the example of dual-arm coordinated pouring task, we illustrate how to synthesize corresponding pre-grasping and lifting trajectories for different new placements and novel instances of manipulated objects during the initial frame alignment phase.
The overview of BiDemoSyn. It consists of three stages (e.g., deconstruction, alignment, and optimization) based on a given demonstration. Then, we can apply our method to complete data collection efficiently and conveniently in real-world. It is best to zoom in to view the details.
Illustrations of the initial frame alignment stage applied to tasks pouring (left and middle) and reorient (right). It shows that we can automatically adjust the grasp pose after the position, orientation and shape of the manipulated object changes.
â–¶ â‘ Implementation of Data Collection, Processing and Synthesis
Left-Top: The fixed-base dual-arm manipulator platform (a table with two robot arms, two grippers and the binocular camera) used in this research. Left-Bottom: Object assets involved in all six bimanual manipulation tasks. Each object has been scaled down proportionally. Right: The specific grid cell division way for each task. For tasks involving two manipulated objects, the total number of grid cells will be divided equally between the left side and right side.
| task 1: plugpen | ||
|
|
||
| task 2: inserting | ||
|
|
||
| task 3: unscrew | ||
|
|
||
| task 4: pouring | ||
|
|
||
| task 5: pressing | ||
|
|
||
| task 6: reorient | ||
|
|
||
UI interface and real-world data collection examples. We developed a convenient UI interface for quick collection of diverse observation images. In the interface, we can check whether a grid cell is sampled or not for the corresponding object. When collecting data, the six tasks need to display different grid cells in real time, as well as perceive and track objects related to the tasks. Note that these points, crosses and lines are drawn digitally, which are not marks in the real world. All related videos are only accelerated 2 times, which can still reflect the high collection efficiency.
From One to Many â‘ â¢â“ƒ. Left: Six representative bimanual manipulation tasks with their one-shot demonstrations and task-specific descriptors. Right: Real-world data collection diagrams, showing object instances with varied geometries and spatial arrangements used to synthesize diverse demonstrations (e.g., thousands physically consistent trajectories per task).
â–¶ â‘¡ Training, Deployment and Evaluation of Imitation Policies
We adapt three advanced visuomotor policies Diffusion Policy(DP), 3D Diffusion Policy(DP3) and EquiBot to bimanual settings by modifying their modeling spaces to dual-arm actions. Observation inputs are RGB-only images or segmented 3D point clouds of task-relevant objects using Florence2 + SAM2, and policies operate in the open-loop discrete keyposes prediction to align with our synthesized demonstration format.
Compared baseline methods include two categories. One category is purely for data collection including point cloud editing DemoGen, real robot auto-rollout YOTO, and human drag teaching (close to Teleoperation). Generally, the quality of collected demonstrations by these baselines is better in turn, but the cost is more time-consuming. The another category for comparing is directly for bimanual manipulation without retraining including the zero-shot ReKep, an advanced ReKep+ with oracle-level grasp labels at the beginning, and one-shot ODIL.
Our experiments address three core questions. Q1: Is BiDemoSyn truly efficient and user-friendly? Q2: Does BiDemoSyn enable scalable visuomotor imitation learning? Q3: Does synthetic demonstrations generalize to spatial and object variations?
Above results have answered three questions raised earlier, demonstrating that BiDemoSyn can efficiently synthesize high-quality real-world demonstrations, enabling scalable and generalizable visuomotor policy training with minimal human input.
A1: The efficiency and usability of BiDemoSyn have obvious advantages over baselines.
A2: Demonstrations obtained via BiDemoSyn can support scalable imitation learning.
A3: Policies trained on BiDemoSyn data can achieve generalization to unseen variations.
â–¶ â‘¢ Visualization and Analysis of Real Robot Rollouts
Above is the visualization of all six bimanual tasks performed on real robots. All models are trained and tested under the ID evaluations. EquiBot is chosen as the visuomotor policy. Key dual-arm coordination movements associated with each task are partially enlarged for quick review.
We here supplement more qualitative videos of real robot rollouts. The blue rectangle boxes in each video are the corresponding given one-shot demonstrations. The red rectangle boxes are results of novel configurations (e.g., new placements or instances). All related videos have been sped up 2x for faster viewing and checking. These results underscore BiDemoSyn's ability to handle real-world complexities (such as mechanical tolerances and imperfect perception), while also exposing limitations in dynamic force modulation (e.g., over-pressing bottles or lids).
While policies trained with BiDemoSyn achieve high success rates, failure analysis reveals systematic challenges. Note that although the policy we trained is executed end-to-end (it establishes an implicit mapping from observations to action outputs), we can still align the analysis from the stage where its failure cases are located to find the core cause of the error. Finally, using the experimental results of EquiBot under out-of-distribution evaluations, we categorize failures into five types according to the task execution logic (mainly from the design module of BiDemoSyn):
As can be seen, the orientation estimation and initial grasp failures dominate, reflecting two core challenges: (1) current pose estimators struggle with symmetric or textureless objects (e.g., metal spoon or shovel), and (2) gripper-centric path planning lacks fine-grained contact modeling (e.g., avoiding pre-touch collisions for irregular shapes). Addressing these requires advances in category-agnostic pose estimation and short-horizon contact optimization, which are critical directions for our future work.
Citation
Acknowledgements
We acknowledge the providers of various hardware used in this project, including the Aubo-i5 robotic arm, DH gripper PGI-80-80, and kingfisher binocular camera.
The website template was borrowed from Jon Barron and Zip-NeRF.