Bridging the Vision Gap

This research introduces HVSBench, a benchmark for assessing how closely Multimodal Large Language Models perceive visual information like the human visual system.

Evaluates MLLMs on fundamental visual perception abilities including attention allocation, visual illusions, and cognitive biases
Reveals significant gaps between human and MLLM visual perception capabilities
Provides insights for developing more human-aligned visual AI systems

For security applications, understanding these perception gaps is crucial when deploying vision-based AI in monitoring systems, as misalignments could lead to critical oversights or false detections that humans wouldn't make.

Do Multimodal Large Language Models See Like Humans?