BiNoMaP: Learning Category-Level Bimanual Non-Prehensile Manipulation Primitives

  • Huayi Zhou
    The Chinese University of Hong Kong, Shenzhen
  • Kui Jia
    The Chinese University of Hong Kong, Shenzhen; DexForce, Shenzhen

Abstract

Non-prehensile manipulation, encompassing ungraspable actions such as pushing, poking, and pivoting, represents a critical yet underexplored domain in robotics due to its contact-rich and analytically intractable nature. In this work, we revisit this problem from two novel perspectives. First, we move beyond the usual single-arm setup and the strong assumption of favorable external dexterity such as walls, ramps, or edges. Instead, we advocate a generalizable dual-arm configuration and establish a suite of Bimanual Non-prehensile Manipulation Primitives (BiNoMaP). Second, we depart from the prevailing RL-based paradigm and propose a three-stage, RL-free framework to learn non-prehensile skills. Specifically, we begin by extracting bimanual hand motion trajectories from video demonstrations. Due to visual inaccuracies and morphological gaps, these coarse trajectories are difficult to transfer directly to robotic end-effectors. To address this, we propose a geometry-aware post-optimization algorithm that refines raw motions into executable manipulation primitives that conform to specific motion patterns. Beyond instance-level reproduction, we further enable category-level generalization by parameterizing the learned primitives with object-relevant geometric attributes, particularly size, resulting in adaptable and general parameterized manipulation primitives. We validate BiNoMaP across a range of representative bimanual tasks and diverse object categories, demonstrating its effectiveness, efficiency, versatility, and superior generalization capability.

â–¶ Overview and Framework of BiNoMaP

Our contributions: (1) We propose the first RL-free framework for learning Bimanual Non-Prehensile Manipulation Primitives directly from human video demonstrations. (2) We introduce a parameterization scheme that enables category-level generalization of non-prehensile skills across diverse object instances. (3) We demonstrate the effectiveness, efficiency, versatility, and generality of BiNoMaP across a variety of tasks, objects, and strong baselines.

overview

Bimanual Non-Prehensile Manipulation Primitives (BiNoMaP). (Left) We propose to extract coarse hand trajectories of non-prehensile skills from human video demonstrations, and then refine and optimize them to the dual-arm robot. These reproduced skills can be further parameterized from instance-level to category-level. (Right) We extensively validated BiNoMaP on four skills (e.g., poking, pivoting, pushing, and wrapping) involving a variety of objects.

overview

BiNoMaP framework. (1) The first stage leverages strong priors from hand demonstrations to obtain coarse dual-arm trajectories for non-prehensile tasks. (2) The second stage refines these trajectories to mitigate multi-source noise and improve execution stability. (3) The final stage generalizes learned skills to novel objects within the same category by parameterizing primitives.

â–¶ Implementation Details of the Three-Stage BiNoMaP

Hardware: the fixed-base dual-arm manipulator platform + a Kingfisher R-6000 binocular camera. Tasks: four representative skills, including poking, pivoting, pushing, and wrapping.

overview

Object Assets: The object assets involved in our selected four non-prehensile skills and eight bimanual manipulation tasks. All objects have been scaled down proportionally.

overview
overview

(Stage 1) Bimanual Hand Trajectory Extraction. 3D Hand Reconstruction from Videos & Hand-to-Robot Trajectory Extraction.

overview

(Stage 2) Coarse-to-Fine Motion Post-Optimization. Motion Smoothness Optimization & Geometry-Aware Iterative Contact Adjustment.

Example1: pivoting a bowl Example2: wrapping a basket
overview

(Stage 3) Category-Level Primitive Parameterization. Our BiNoMaP allows the learned skill to adapt to other objects of the same category. Here is the example for achieving category-level generalization of pivoting a bowl.

â–¶ Visualization and Video Records of Real Robot Rollouts

Here are examples of four non-prehensile skills instantiated with different tasks and diverse objects. More results of real robot rollouts can be found in below videos.

overview
overview

To better understand the practical performance of BiNoMaP, we collected and summarized representative failure cases observed in real-robot experiments across the four skills and eight tasks.

overview
overview

Here are examples for cross-embodiment transferring of BiNoMaP into a humanoid dual-arm robot. We have transferred two learned skills (pivoting a bowl and wrapping a basket).

overview
overview
overview

Here are examples showing the compositionality of learned skills via BiNoMaP for three downstream applications (e.g., pre-grasping, rearrangement, and error recovery).

overview

Citation

Acknowledgements

We acknowledge the providers of various hardware used in this project, including the Aubo-i5 robotic arm, Rokae xMate CR7 robotic arm, DH gripper PGI-80-80, Jodell Robotics RG75-300, and kingfisher binocular camera.

The website template was borrowed from Jon Barron and Zip-NeRF.