MallowsPO: Fine-Tune Your LLM with Preference Dispersions

MallowsPO: Fine-Tune Your LLM with Preference Dispersions

By Haoxian Chen, Hanyang Zhao...

Abstract:

Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversit...

Key points:

  • Research on large language models
  • Engineering application

Source: MallowsPO: Fine-Tune Your LLM with Preference Dispersions

29 | 521