Self-Rewarding Language Models

Self-Rewarding Language Models (SRLMs) utilize an LLM’s own judgment to provide feedback during training, potentially enabling superhuman performance.

Self-Rewarding Language Models

Areas of application

    • SRLMs introduce a novel approach to language model training, where the LLM itself serves as a judge to provide rewards for its own outputs.

    • This method eliminates the need for separate reward models, which can be limited by human performance and lack the ability to improve alongside the LLM.

    • Iterative DPO training with SRLMs demonstrates significant improvements in both instruction following ability and reward quality.

    • Fine-tuning Llama 2 70B with three iterations of SRLM training achieves state-of-the-art performance on the AlpacaEval 2.0 leaderboard.

    • SRLMs hold promise for developing models that can continuously self-improve in both instruction following and reward provision.

Example