← SELF-ALIGN ReFT: Reasoning with Reinforced Fine-Tuning →

Self-Rewarding Language Models

Self-Rewarding Language Models (SRLMs) utilize an LLM’s own judgment to provide feedback during training, potentially enabling superhuman performance.

Self-Rewarding Language Models

Areas of application

- SRLMs introduce a novel approach to language model training, where the LLM itself serves as a judge to provide rewards for its own outputs.

- This method eliminates the need for separate reward models, which can be limited by human performance and lack the ability to improve alongside the LLM.

- Iterative DPO training with SRLMs demonstrates significant improvements in both instruction following ability and reward quality.

- Fine-tuning Llama 2 70B with three iterations of SRLM training achieves state-of-the-art performance on the AlpacaEval 2.0 leaderboard.

- SRLMs hold promise for developing models that can continuously self-improve in both instruction following and reward provision.

Example

Resources

[2401.10020] Self-Rewarding Language Models (arxiv.org)

← SELF-ALIGN ReFT: Reasoning with Reinforced Fine-Tuning →