FALCON: Revolutionizing Visual Processing in MLLMs

FALCON introduces a breakthrough approach to handle high-resolution images in multimodal large language models by addressing visual redundancy and fragmentation problems.

Key Innovations:

Introduces Visual Register technique to eliminate redundant tokens
Employs Register-based Representation Compacting (ReCompact) for efficient processing
Implements Register Interactive Attention (ReAtten) to enhance visual reasoning
Achieves superior performance while reducing computational overhead

This engineering advancement enables more efficient implementation of MLLMs in real-world applications requiring detailed visual analysis, potentially transforming how AI systems process and understand high-resolution visual content.

FALCON: Resolving Visual Redundancy and Fragmentation in High-resolution Multimodal Large Language Models via Visual Registers