Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

🧸Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Guangqi Jiang^*¹    Yifei Sun^*²    Tao Huang^*³    Huanyu Li³
Yongyuan Liang^†⁴    Huazhe Xu^†⁵
¹UC San Diego          ²Tongji University          ³Shanghai Jiao Tong University
⁴University of Maryland          ⁵Tsinghua University
* Equal contribution. † Equal advising.
💐 ICLR 2025 🎉

The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with an action prediction loss and a time contrastive loss during pre-training. Empirical results across four simulation domains with 20 robotic manipulation tasks demonstrate that MCR outperforms the strongest baseline by 14.8%. Additionally, MCR boosts the success rate in three real-world manipulation tasks by 76.9%.

Manipulation Centricity

Through analyzing feature similarities between Grad-CAM visualizations and SAM2-identified ground truth regions, Manipulation Centricity quantifies a representation's focus on task-relevant areas, predicting downstream performance.

Real-world Manipulation Benchmark

Lift up (2x speed)

Sweep (2x speed)

Rearrange (2x speed)

MCR consistently outperforms baselines
across all real-world task

Grad-CAM visualization on Rearrange.
MCR is with best manipulation centricty.

Simulation Benchmark

Grad-CAM visualization for the Square task from Robomimic and the Pick Place Wall task from MetaWorld

4 Domains: MetaWorld, DexArt, Robomimic, RoboCasa; 20 Tasks

Findings in robotic datasets

Larger dataset, better performance.

Benefits for tasks with less embodiment gap.

Feature Analysis

We do t-SNE visualization on 10 simulation tasks from MetaWorld and 3 real robot tasks. Each dot represents an image frame and each color indicates a task. The results demonstrate that (1) our representation has the best clustering ability and (2) robot data is helpful to robotic representation.

BibTeX

@article{jiang2024robots, title={Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets}, author={Jiang, Guangqi and Sun, Yifei and Huang, Tao and Li, Huanyu and Liang, Yongyuan and Xu, Huazhe}, journal={arXiv preprint arXiv:2410.22325}, year={2024} }