MallowsPO: Fine-Tune Your LLM with Preference Dispersions

Abstract:

Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversit...

Key points:

Research on large language models
Engineering application

Source: MallowsPO: Fine-Tune Your LLM with Preference Dispersions