OTTER: Revolutionizing Robotic Decision-Making

OTTER: Revolutionizing Robotic Decision-Making

Text-Aware Visual Processing for Enhanced Robotic Control

OTTER introduces a novel Vision-Language-Action architecture that helps robots better understand instructions and visual environments for improved task performance.

  • Maintains semantic alignments from pre-trained vision-language models through text-aware visual feature extraction
  • Processes only task-relevant visual elements instead of entire scenes
  • Demonstrates superior performance in robotic instruction following tasks
  • Provides a more efficient approach than existing models requiring extensive fine-tuning

This research significantly advances robotic engineering by creating more intuitive human-robot interfaces and improving robots' ability to operate in complex, real-world environments with natural language instructions.

OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

115 | 168