Dexterity-BEV: Aligning 3D World and Actions for Generalizable Robot Policies Learning
- Huayi Zhou$^{*\;1,2}$, Wei Gao$^{*\;1}$, Dekun Lu$^{1}$, Ruiji Liu$^{1}$, Zhanqi Zhang$^{1}$, Ziyang Zhang$^{1}$, Jian Chen$^{1}$, Wenlve Zhou$^{1}$, Sheng Xu$^{2}$, Shumin Li$^{1}$, Kangyi Guo$^{1}$, Shichen Xu$^{1}$, Zixin Huang$^{1}$, Yongyi Su$^{1}$, Kui Jia$^{\ddagger\;1,2}$ $^{1}$DexForce Technology $^{2}$The Chinese University of Hong Kong, Shenzhen $^*$Equal Contribution $^{\ddagger}$Corresponding Author
Abstract
End-to-end manipulation policies, combined with web-scale pretrained Vision-Language Models (VLMs), show the promise for generalizable and dexterous robotic manipulation. However, they inherit two key limitations from 2D foundation models: 1) the reliance on 2D RGB inputs that ignores the intrinsically 3D nature of manipulation; and 2) the lack of spatial 3D alignment between input-output spaces as well as across diverse robot embodiments, camera setups, and trajectory datasets. In this paper, we present a series of contributions to address these issues. First, we introduce aligned vertex map and vertex spectrum — a pixel-wise 3D representation that elevates 2D visual inputs to 3D, using camera calibration and optional depth. This novel input representation marries 3D awareness with the generalization of 2D large VLMs. Then, we propose to align the inputs and outputs of manipulation policies by expressing per-pixel 3D information of each camera view and robot actions to a shared coordinate. Based on this, we designate a canonical Bird's-Eye-View (BEV) alignment frame and innovatively propose to construct BEV images, producing a view-invariant representation robust to camera pose variations. To enable training and evaluation at scale, we develop a comprehensive data processing pipeline to perform such alignments; we also introduce a novel temporal alignment scheme for trajectories across diverse robots, human operators, and datasets. These contributions collectively mitigate input and output spatial-temporal misalignments, improving the consistency and generalization for real-world manipulation.
â–¶ Framework and Overview of Dexterity-BEV (Dex-BEV)
We introduce Dexterity-BEV (Dex-BEV), a series of technical and systematic contributions for manipulation policy learning that generalizes among different embodiments, camera views and datasets. In particular, we introduce 3D input representations that easily integrated with pretrained 2D VLMs; spatial alignment between multi-view cameras & robot actions; and temporal alignment between trajectories from different robots and/or tele-operators. These concepts lead to a comprehensive data processing pipeline and trajectory datasets aligned spatially and temporally.
Dexterity-BEV (Dex-BEV) Framework. (a) We propose to construct BEV images and associated vertex maps towards invariance to different camera view points. Note that the synthesized BEV images for two vastly different camera poses are very similar to each other, and objects are located at almost identical pixel locations in BEV images. (b) An overview of Dex-BEV architecture. Please refer Sec. 3.3 of the main paper for a detailed explanation.
3D spatial alignment in our data processing pipeline. (a) We develop a customized GUI application for 3D alignment and visualization, as explained in Subsec. 3.4 of the main paper. In (a-f), we show the 3D alignment of representative public and internal datasets, including (a) LIBERO, (b) Agibot-World Alpha/Beta, (c) RoboTwin 2.0, (d) RoboMind 2.0 and our internal datasets (e-f). We also apply an unified TCP frame convention, as shown in these figures.
â–¶ â‘ The Customized GUI Application for Data Alignment
Left: The workflow demonstration of the customized GUI application for 3D alignment and visualization. The used example comes from LIBERO. Right: The constructed BEV images and associated vertex maps towards invariance to different camera view points. The used example also comes from LIBERO.
▶ ② Training, Deployment and Evaluation of Dex-BEV
Our evaluation aims to demonstrate that Dex-BEV provides a superior and more interpretable framework for dexterous robotic manipulation compared to existing 2D and 3D VLA paradigms. We systematically test its efficacy across diverse simulated benchmarks (including LIBERO and RoboTwin 2.0) and four different real-world bimanual robotic platforms, focusing on its spatial reasoning capabilities and cross-embodiment generalization. The compared strong baselines mainly contain $\pi_0$ and X-VLA
Quantitative Results on Simulation Benchmarks. We evaluate the performance of Dex-BEV against competitive VLA baselines, including $\pi_0$ and X-VLA, on the official and modified configurations of the LIBERO and RoboTwin-2.0 benchmarks. As summarized in Tab. 1, deploying a single set of network weights across vastly different platforms—the single-arm Franka for LIBERO and the dual-arm Agilex for RoboTwin—our framework achieves comparable success rates on LIBERO and significantly outperforms the state-of-the-art baselines on RoboTwin. Furthermore, under the highly challenging mutated setups where camera viewpoints, robot bases, and scene layouts are subjected to severe 6-DoF random perturbations, standard 2D VLA policies completely fail (yielding success rates below 10% as shown in Tab. 2). In sharp contrast, Dex-BEV maintains a robust and stable success rate of 89.9%, validating that our 3D spatial alignment and view-invariant BEV representations effectively absorb pose variations that present insurmountable bottlenecks for conventional 2D architectures.
Comparative Performance on Real-World Hardware. To demonstrate the physical agility and embodiment-agnostic utility of Dex-BEV, we baseline our framework against industry-standard competitors ($\pi_0$ and X-VLA) across four distinct dual-arm hardware setups executing five challenging, long-horizon tasks involving articulated, deformable, and granular objects. The empirical results in Tab. 3 demonstrate that Dex-BEV commands a commanding performance margin over all baselines, achieving unprecedented state-of-the-art success rates across all tasks (e.g., reaching 93.3% for Agilex Fold Cloth, 86.7% for W1 Scoop Popcorn, and 96.7% for A1 Fold Cloth). Crucially, while existing policies exhibit severe performance volatility under unexpected human-in-the-loop disturbances or novel object color/scale variations, Dex-BEV exhibits exceptional zero-shot out-of-distribution generalization and continuous error self-recovery. This stark contrast confirms that our framework successfully captures the underlying 3D spatial mechanics of complex bimanual coordination rather than merely memorizing localized 2D visual correlation patterns.
▶ ③ Visualization and Analysis of Real-World Rollouts
Below are four dual-arm platforms for conducting five real-world bimanual dexterous manipulation tasks.
[Task 1]/[T1] Agilex Bimanual Evaluations: Fold Mailer Box
| [T1] fold one time (ID, speed 2x) | [T1] fold one time (ID, speed 2x) | [T1] fold two times (ID, speed 4x) | [T1] fold two times (ID, speed 4x) |
| [T1] error recovery (OOD, speed 4x) | [T1] error recovery (OOD, speed 4x) | [T1] error recovery (OOD, speed 4x) |
| [T1] fold three consecutive times (ID, speed 4x) | [T1] fold four consecutive times (ID, speed 4x) | [T1] fold five consecutive times (ID, speed 4x) |
[Task 2]/[T2] Agilex Bimanual Evaluations: Fold Cloth
| [T2] seen T-shirt (ID, speed 4x) | [T2] seen T-shirt (ID, speed 4x) | [T2] seen T-shirt (ID, speed 4x) | [T2] seen T-shirt (ID, speed 4x) |
| [T2] unseen colors/sizes (OOD, speed 4x) | [T2] unseen colors/sizes (OOD, speed 4x) | [T2] unseen colors/sizes (OOD, speed 4x) |
[Task 3]/[T3] DexForce W1 Humanoid Evaluations: Scoop Popcorn
| [T3] front view (ID, speed 4x) | [T3] side view (ID, speed 4x) |
| [T3] dynamic interference (OOD, speed 4x) | [T3] dynamic interference (OOD, speed 4x) |
[Task 4]/[T4] DexForce W1 Humanoid Evaluations: Handover Book
| [T4] handover blue book (ID, speed 2x) | [T4] handover brown book (ID, speed 2x) |
| [T4] blue book + interference (OOD, speed 4x) | [T4] brown book + interference (OOD, speed 4x) |
[Task 5]/[T5] DexForce A1 Semi-Humanoid Evaluations: Fold Cloth
| [T5] flat initial state (ID, speed 4x) | [T5] flat initial state (ID, speed 4x) | [T5] flat initial state (ID, speed 4x) | [T5] flat initial state (ID, speed 4x) |
| [T5] chaotic initial state (ID, speed 8x) | [T5] chaotic initial state (ID, speed 8x) | [T5] chaotic initial state (ID, speed 8x) |
Citation
Acknowledgements
This work was funded by the Key-Area Research and Development Program of Guangdong Province, China under Grant 2024B0101040004, and the Shenzhen Science and Technology Program under Grant KJZD20240903104008012 and ZDCY20250901113000001.
Beyond that, this work was supported by the major leadership and directional guidance of Kui Jia. We sincerely thank all the contributors for their dedication: co-first authors Huayi Zhou and Wei Gao conceptualized the framework and drafted the manuscript, with Huayi Zhou conducting Agilex real-world experiments, and Wei Gao leading the simulation benchmarks, real-world deployment, and core data infrastructure; Dekun Lu and Jian Chen assisted with the data infrastructure and hardware testing; Ruiji Liu, Zhanqi Zhang, and Ziyang Zhang managed the real-robot evaluations on the A1 semi-humanoid and W1 humanoid configurations; Wenlve Zhou, Sheng Xu, and Yongyi Su contributed to text polishing and technical discussions; and Shumin Li, Kangyi Guo, Shichen Xu, and Zixin Huang supported the large-scale real-world teleoperation data collection.
The website template was borrowed from Jon Barron and Zip-NeRF.