Connecting Language to Actions & the World @ CMU



Project

Description

Open-Vocabulary Mobile Manipulation

Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. The benchmark includes both simulation environments and parallel stack for robot control.

WebQA Benchmark

WebQA, is a new benchmark for multimodal multihop reasoning in which systems are presented with the same style of data as humans when searching the web: Snippets and Images. The system must then identify which information is relevant across modalities and combine it with reasoning to answer the query. Systems will be evaluated on both the correctness of their answers and their sources.

VLC Checkpoints

Training Vision-language Transformers from Captions Alone. An MAE based Vision-Language Transformer that doesn't rely on supervised class labels.

ALFRED &
FILM Code

ALFRED (Action Learning From Realistic Environments and Directives), is a new benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. Long composition rollouts with non-reversible state changes are among the phenomena we include to shrink the gap between research benchmarks and real-world applications.