Almost Surely Safe Alignment of Large Language Models at Inf...

Abstract:

Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensure...

Key points:

Research on large language models
Security application

Source: Almost Surely Safe Alignment of Large Language Models at Inference-Time