Self-Rewarding Language Models (SRLMs) utilize an LLM’s own judgment to provide feedback during training, potentially enabling superhuman performance.
Areas of application
SRLMs introduce a novel approach to language model training, where the LLM itself serves as a judge to provide rewards for its own outputs.
This method eliminates the need for separate reward models, which can be limited by human performance and lack the ability to improve alongside the LLM.
Iterative DPO training with SRLMs demonstrates significant improvements in both instruction following ability and reward quality.
Fine-tuning Llama 2 70B with three iterations of SRLM training achieves state-of-the-art performance on the AlpacaEval 2.0 leaderboard.
SRLMs hold promise for developing models that can continuously self-improve in both instruction following and reward provision.