Enhancing Vision Systems with Text-Guided Multimodal Fusion

Enhancing Vision Systems with Text-Guided Multimodal Fusion

Leveraging LLMs for RGB-Thermal fusion in challenging conditions

This research introduces a structurally simple yet adaptable multimodal fusion model that leverages large language models to combine RGB and thermal imaging for enhanced vision systems.

  • Effectively combines visual data with thermal imaging for consistent performance across variable weather and lighting conditions
  • Utilizes LLMs to extract valuable information from natural language prompts
  • Creates a more adaptable and efficient approach compared to traditional complex fusion modules
  • Particularly valuable for security applications including surveillance and autonomous driving systems that must operate reliably in challenging environmental conditions

MASTER: Multimodal Segmentation with Text Prompts

61 | 100