Defending LLMs Against Unsafe Feedback

This research explores a critical but overlooked vulnerability: how reinforcement learning from human feedback (RLHF) can be compromised by unsafe or malicious feedback inputs.

Identifies novel risks in feedback collection systems widely deployed in production LLMs
Evaluates effectiveness of different defense mechanisms against unsafe feedback
Demonstrates that current safety guards can be undermined through manipulated feedback
Proposes practical safeguards for more robust RLHF implementations

As organizations increasingly rely on user feedback to improve AI systems, this work provides essential guidance for security professionals to protect LLMs from being manipulated into producing harmful content through their training feedback loops.

Evaluating Defences against Unsafe Feedback in RLHF