AI Safety Blindspots in Scientific Labs

This research introduces LabSafety Bench, the first comprehensive benchmark for evaluating LLMs' ability to identify and respond to laboratory safety risks.

Tests LLMs across 8 safety domains with 2,100+ questions derived from OSHA standards
Reveals significant performance gaps even in advanced models like GPT-4 and Claude
Demonstrates concerning model tendencies to provide dangerously incomplete safety advice
Highlights urgent need for specialized safety training in scientific AI systems

For security professionals, this research exposes critical vulnerabilities in AI-guided laboratory workflows that could lead to physical harm, highlighting the need for rigorous safety evaluation before deployment in high-stakes scientific environments.

LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs