Predicting AI Risks Before They Scale

Predicting AI Risks Before They Scale

Forecasting rare but dangerous language model behaviors

New methodology to identify security risks in language models that only emerge when deployed at scale, before they happen.

  • Uses elicitation probability to predict harmful responses that standard testing might miss
  • Enables forecasting risky behaviors across billions of potential queries
  • Identifies vulnerabilities related to dangerous information sharing and inappropriate advice
  • Helps prevent security breaches that only appear after wide deployment

This research is critical for AI safety as it allows companies to detect and mitigate rare but severe security risks before public deployment, potentially preventing harmful AI interactions at scale.

Forecasting Rare Language Model Behaviors

10 | 27