
MallowsPO: Fine-Tune Your LLM with Preference Dispersions
By Haoxian Chen, Hanyang Zhao...
Abstract:
Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversit...
Key points:
- Research on large language models
- Engineering application
Source: MallowsPO: Fine-Tune Your LLM with Preference Dispersions