Teaching Robots Through Video Observation

This research introduces Moto, a novel approach that allows robots to learn manipulation skills by observing human demonstration videos without explicit action labeling.

Creates a unified representation of motion that works across different embodiments
Leverages abundant video data instead of expensive labeled demonstrations
Achieves zero-shot transfer of skills from human videos to robot actions
Demonstrates improved performance on manipulation tasks with minimal training

This breakthrough has significant implications for manufacturing automation, enabling robots to learn complex assembly tasks more efficiently and reducing the programming burden for industrial applications.

Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos