VLBiMan: Vision-Language Anchored One-Shot Demonstration Enables Generalizable Bimanual Robotic Manipulation

  • Huayi Zhou
    The Chinese University of Hong Kong, Shenzhen
  • Kui Jia
    The Chinese University of Hong Kong, Shenzhen; DexForce, Shenzhen

Abstract

Achieving generalizable bimanual manipulation requires systems that can learn efficiently from minimal human input while adapting to real-world uncertainties and diverse embodiments. Existing approaches face a dilemma: imitation policy learning demands extensive demonstrations to cover task variations, while modular methods often lack flexibility in dynamic scenes. We introduce VLBiMan, a framework that derives reusable skills from a single human example through task-aware decomposition, preserving invariant primitives as anchors while dynamically adapting adjustable components via vision-language grounding. This adaptation mechanism resolves scene ambiguities caused by background changes, object repositioning, or visual clutter without policy retraining, leveraging semantic parsing and geometric feasibility constraints. Moreover, the system inherits human-like hybrid control capabilities, enabling mixed synchronous and asynchronous use of both arms. Extensive experiments validate VLBiMan across tool-use and multi-object tasks, demonstrating: (1) a drastic reduction in demonstration requirements compared to imitation baselines, (2) compositional generalization through atomic skill splicing for long-horizon tasks, (3) robustness to novel but semantically similar objects and external disturbances, and (4) strong cross-embodiment transfer, showing that skills learned from human demonstrations can be instantiated on different robotic platforms without retraining. By bridging human priors with vision-language anchored adaptation, our work takes a step toward practical and versatile dual-arm manipulation in unstructured settings.

▶ Overview and Framework of VLBiMan

Our contributions in this research are threefold: (1) We propose VLBiMan, a novel framework that enables generalizable bimanual manipulation through one-shot demonstration and vision-language anchoring, without retraining. (2) We introduce a task-aware motion decomposition and adaptation mechanism, which reuses invariant sub-skills via object-centric anchors from VLMs and supports cross-embodiment transfer from human demonstrations to different robotic embodiments. (3) We validate VLBiMan on ten diverse bimanual tasks, showing superior generalization, sample efficiency, and robustness compared to strong baselines.

overview

Vision-Language Anchored Bimanual Manipulation (VLBiMan). Left: Taking pouring water as an example, we sketch the entire process of VLBiMan based on the one-shot demonstration. Right: VLBiMan can achieve generalizable bimanual manipulation on a variety of complex contact-rich tasks without retraining, robustly coping with diverse scenarios.

overview

Detailed framework of VLBiMan. Taking the pouring water as an example, the paradigm consists of three stages (e.g., decomposition, adaptation, and composition) based on a given demonstration. VLBiMan can achieve generalization of unseen spatial placements and category-level new instances under the same task.

overview

Illustrations of representative points for manipulated objects in three tasks: pouring (left), reorient+unscrew (middle) and tool-use:spoon (right). These points will be used to calculate the change in object position and orientation (not always required).

▶ Tasks, Implementation Details and Experimental Results

overview

Top-Right: The fixed-base dual-arm manipulator platform (a table with two robot arms, two grippers and the binocular camera) used in this research. Left and Bottom: Manipulated object assets involved in all six bimanual manipulation tasks (including plugpen, inserting, unscrew, pouring, pressing and reorient) with primary skills and four long-horizon bimanual manipulation tasks (including reorient+unscrew, unscrew+pouring, tool-use spoon and tool-use funnel). Each object has been scaled down proportionally.

overview
overview

Image Moments based Orientation Estimation. In the Vision-Language Anchored Adaptation pipeline of VLBiMan, our method requires extracting the principal axis and determining the orientation of direction-sensitive objects. This includes the marker pen in the plugpen and inserting tasks, the spoon in the reorient and tool-use spoon tasks, as well as the horizontally placed bottle in the reorient+unscrew task. As shown in Alg.1, we adopt an object principal axis extraction algorithm based on image moments theory. Since this algorithm relies primarily on the 2D segmentation mask of object and does not require any deep networks, its computational overhead is minimal and can be considered negligible in practice.

overview

Baselines and Metric. For each task setting, we conduct 20 trials, where objects are randomly located or replaced, and the success rate will be reported. For baselines, we compare to ReKep based on VFMs (SAM and DINOv2) and GPT-4o, as well as Robot-ABC based on keypoint affordance prediction with using AnyGrasp for initial grasping (After which, the remaining trajectory is obtained by trivial modules combination). Besides, for a convincing comparison, an enhanced ReKep+ is introduced, where we inject an oracle-level initial grasp label to mitigate the impact of noisy perception. We also adapt two one-shot single-arm manipulation methods Mechanisms and MAGIC for our dual-arm tasks.

In experiments, we aim to answer following research questions: (1) How well does our framework automatically formulate and synthesize bimanual manipulation behaviors? (2) Can our method generalize to novel scenarios and achieve effective combination of skills? (3) How do individual components contribute to the effectiveness and robustness of our system?

overview
overview
overview
overview
overview
overview
overview

Above results have answered three questions raised earlier, demonstrating that VLBiMan can efficiently compose executable bimanual trajectories under diverse scene variations. Without reliance on object-specific priors or pose annotations, VLBiMan achieves robust generalization across unseen object instances and layouts.

▶ Visualization and Video Records of Real Robot Rollouts

overview

Above is the visualization of all ten tasks executed on real robots. They are designed to validate different aspects, including six dual-arm primary skills, combination of basic skills for two long-horizon tasks, and exploration of multi-stage spatiotemporal dependencies in two tool-use tasks.

overview

Video collection of real robot deployments (with human dynamic interference). The central area of each video shows manipulated objects involved in each task. All ten tasks include plugpen, inserting, unscrew, pouring, pressing, reorient, reorient+unscrew, unscrew+pouring, tool-use spoon and tool-use funnel.

plugpen inserting
unscrew pouring
pressing reorient
reorient+unscrew unscrew+pouring
tool-use spoon tool-use funnel
overview

Good robustness of VLBiMan to lighting changes. Below videos comparison reveals the impact of uneven lighting on the robustness of six basic bimanual tasks (still with interference). As can be seen, our solution VLBiMan combined with VLMs is robust to uneven lighting.

overview
Even Light Results Uneven Light Results
overview

Efficient synchronous dual-arm movement. Below is comparison of the time consumption for asynchronous or synchronous execution of all ten tasks. We also created a dynamic comparison chart of all ten tasks into one page slice for quick viewing. The coordination and synchronization of two arms can obviously improve manipulation efficiency. We observed time savings of varying magnitudes across all ten tasks, yielding an average improvement in execution efficiency of approximately 22%.

plugpen inserting unscrew pouring
pressing reorient reorient+unscrew unscrew+pouring
tool-use spoon tool-use funnel
overview

Cross-Embodiment Transferability of VLBiMan. To further assess the generalization ability of VLBiMan, we investigate its cross-embodiment transferability. Specifically, we evaluate how a one-shot demonstration collected from a human demonstrator can be transferred to a robotic embodiment with different kinematic and actuation constraints. We report both qualitative visualizations and quantitative results, focusing on four representative bimanual tasks: inserting, unscrew, pouring, and reorient.

overview
overview

Citation

Acknowledgements

We acknowledge the providers of various hardware used in this project, including the Aubo-i5 robotic arm, Rokae xMate CR7 robotic arm, DH gripper PGI-80-80, Jodell Robotics RG75-300, and kingfisher binocular camera.

The website template was borrowed from Jon Barron and Zip-NeRF.