LLaRA: Teaching Robots with VLMs

LLaRA transforms robot action policy into visuo-textual conversations, enabling efficient transfer of pretrained Vision Language Models (VLMs) to robotics with minimal demonstrations.

Formulates robot control as a conversation between visual inputs and textual commands
Enables more efficient learning from limited robot demonstration data
Bridges the gap between powerful VLMs and practical robotic applications
Demonstrates effectiveness in both simulated and real-world tasks

This research advances robot engineering by tackling a fundamental challenge: how to leverage powerful language models for physical control systems without requiring massive amounts of specialized robotics data.

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy