
Almost Surely Safe Alignment of Large Language Models at Inf...
By Xiaotong Ji, Shyam Sundhar Ramesh...
Abstract:
Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensure...
Key points:
- Research on large language models
- Security application
Source: Almost Surely Safe Alignment of Large Language Models at Inference-Time