ARCON: Next-Generation Video Prediction

ARCON introduces a novel approach for video continuation using Large Vision Models (LVMs) by alternating between semantic and RGB token generation for more consistent and accurate predictions.

Implements an alternating semantic-RGB token generation scheme to improve structural consistency
Enhances video prediction accuracy through specialized optical flow-based texture stitching
Demonstrates particular effectiveness in driving scenarios where accurate prediction is safety-critical
Bridges the gap between large language models and visual prediction tasks

This research advances engineering capabilities for autonomous vehicles and simulation systems by improving how AI models predict future frames in dynamic environments — a critical component for safe autonomous driving systems and realistic driving simulators.

Original Paper: ARCON: Advancing Auto-Regressive Continuation for Driving Videos