GLOV: Leveraging LLMs to Enhance Vision Models

GLOV: Leveraging LLMs to Enhance Vision Models

Using language models as optimization guides for vision-language systems

GLOV transforms large language models (LLMs) into implicit optimizers that guide vision-language models (VLMs) to achieve better performance and enhanced security.

  • Improves zero-shot classification accuracy with CLIP by generating optimized prompts
  • Reduces attack success rates on state-of-the-art VLMs by up to 60.7%
  • Creates a feedback loop where LLMs continuously refine VLM prompts based on performance
  • Achieves these improvements without additional training or model modification

This research is significant for security applications as it demonstrates how LLM-guided optimization can make vision models more resistant to adversarial attacks, protecting AI systems in critical deployments.

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

9 | 100