Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation

*Indicates equal contribution.

 Paper Video Code

We distill features from 2D foundation models into 3D feature fields, and enable few-shot language-guided manipulation that generalizes across object poses, shapes, appearances and categories.

We designate novel objects to grasp using open-ended language queries, and achieve this using only ten demonstrations across four object categories.

Abstract

Self-supervised and language-supervised image models contain rich knowledge of the world that is important for generalization. Many robotic tasks, however, require a detailed understanding of 3D geometry, which is often lacking in 2D image features. This work bridges this 2D-to-3D gap for robotic manipulation by leveraging distilled feature fields to combine accurate 3D geometry with rich semantics from 2D foundation models. We present a few-shot learning method for 6-DOF grasping and placing that harnesses these strong spatial and semantic priors to achieve in-the-wild generalization to unseen objects. Using features distilled from a vision-language model, CLIP, we present a way to designate novel objects for manipulation via free-text natural language, and demonstrate its ability to generalize to unseen expressions and novel categories of objects.

Video with Audio

Robot Results

Language-Guided Manipulation

We present results on an example scene which we set up with novel objects in 6-DOF poses. We provide the robot with just 10 demonstrations across 4 object categories (mugs, screwdrivers, caterpillar toy, drying racks) in the form of 6-DOF grasp or place poses. We demonstrate the ability to designate novel objects to manipulate via language queries that span color and material properties, as well as unseen object categories.

Few-Shot Grasping Results

We provide the robot with just two demonstrations for each task, such as grasping a mug by its lip. We show generalization across object poses, shapes, sizes and appearances. Our approach does not make any assumptions about objectness, such as segmentation masks, as we optimize for a gripper pose over the entire scene.

Show results for

Grasp Mug Lip

Grasp Mug Lip

2 x Demos on a grey and a red mug

Citation

@inproceedings{shen2023F3RM,
    title={Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation},
    author={Shen, William and Yang, Ge and Yu, Alan and Wong, Jansen and Kaelbling, Leslie Pack and Isola, Phillip},
    booktitle={7th Annual Conference on Robot Learning},
    year={2023}
}