Predicting AI Risks Before They Scale

New methodology to identify security risks in language models that only emerge when deployed at scale, before they happen.

Uses elicitation probability to predict harmful responses that standard testing might miss
Enables forecasting risky behaviors across billions of potential queries
Identifies vulnerabilities related to dangerous information sharing and inappropriate advice
Helps prevent security breaches that only appear after wide deployment

This research is critical for AI safety as it allows companies to detect and mitigate rare but severe security risks before public deployment, potentially preventing harmful AI interactions at scale.

Forecasting Rare Language Model Behaviors