🧸Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets

Guangqi Jiang*1    Yifei Sun*2    Tao Huang*3    Huanyu Li3   
Yongyuan Liang4    Huazhe Xu5   
1UC San Diego          2Tongji University          3Shanghai Jiao Tong University
4University of Maryland          5Tsinghua University
* Equal contribution. † Equal advising.
Image description

Manipulation Centricity measures how well pre-trained visual representations correlate with downstream manipulation tasks, serving as a strong predictor of task success rates. Building on this insight, Manipulation Centric Representation (MCR) enhances manipulation centricity by pre-training visual encoders with large-scale robotic data.

Abstract

The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human videos are inevitably subject to distribution shifts and lack the dynamics information crucial for task completion. We first evaluate various pre-trained representations in terms of their correlation to the downstream robotic manipulation tasks (i.e., manipulation centricity). Interestingly, we find that the "manipulation centricity" is a strong indicator of success rates when applied to downstream tasks. Drawing from these findings, we propose Manipulation Centric Representation (MCR), a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks to improve manipulation centricity. Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions. We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with an action prediction loss and a time contrastive loss during pre-training. Empirical results across four simulation domains with 20 robotic manipulation tasks demonstrate that MCR outperforms the strongest baseline by 14.8%. Additionally, MCR boosts the success rate in three real-world manipulation tasks by 76.9%.

Manipulation Centricity

Image description

Through analyzing feature similarities between Grad-CAM visualizations and SAM2-identified ground truth regions, Manipulation Centricity quantifies a representation's focus on task-relevant areas, predicting downstream performance.

Real-world Manipulation Benchmark

Lift up (2x speed)

Sweep (2x speed)

Rearrange (2x speed)

MCR consistently outperforms baselines
across all real-world task
Grad-CAM visualization on Rearrange.
MCR is with best manipulation centricty.

Simulation Benchmark




Image description

Grad-CAM visualization for the Square task from Robomimic and the Pick Place Wall task from MetaWorld


Image description

4 Domains: MetaWorld, DexArt, Robomimic, RoboCasa; 20 Tasks

Findings in robotic datasets

Larger dataset, better performance.
Benefits for tasks with less embodiment gap.

Feature Analysis

Image description

We do t-SNE visualization on 10 simulation tasks from MetaWorld and 3 real robot tasks. Each dot represents an image frame and each color indicates a task. The results demonstrate that (1) our representation has the best clustering ability and (2) robot data is helpful to robotic representation.

BibTeX

If you find the project helpful for your research, please consider citing our paper:
@article{jiang2024robots,
        title={Robots Pre-Train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets},
        author={Jiang, Guangqi and Sun, Yifei and Huang, Tao and Li, Huanyu and Liang, Yongyuan and Xu, Huazhe},
        journal={arXiv preprint arXiv:2410.22325},
        year={2024}
      }