Enhancing Vision Systems with Text-Guided Multimodal Fusion

This research introduces a structurally simple yet adaptable multimodal fusion model that leverages large language models to combine RGB and thermal imaging for enhanced vision systems.

Effectively combines visual data with thermal imaging for consistent performance across variable weather and lighting conditions
Utilizes LLMs to extract valuable information from natural language prompts
Creates a more adaptable and efficient approach compared to traditional complex fusion modules
Particularly valuable for security applications including surveillance and autonomous driving systems that must operate reliably in challenging environmental conditions

MASTER: Multimodal Segmentation with Text Prompts