This performance is attributed to its training methodology, which emphasizes data quality and incorporates synthetic data throughout the training process. Advancements in post-training techniques further enhance its capabilities, enabling Phi-4 to achieve high-quality results efficiently.
Phi-4 is part of Microsoft’s Phi family of small language models, focusing on achieving high-quality results despite its smaller size. It is available on platforms like Azure AI Foundry and Hugging Face, making it accessible for various applications that require advanced reasoning capabilities.
Phi-4 consistently outperforms smaller models like Phi-3, Qwen 2.5 (14b instruct), and GPT 4o-mini on the following benchmarks:
Phi-4 also achieves competitive or superior performance against larger models like Llama-3.3 (70b instruct), Qwen 2.5 (72b instruct), and GPT 4o:
This showcases Phi-4 as a highly optimized model offering competitive performance while remaining cost-effective.
Benchmark | Phi-4 14b | Phi-3 14b | Qwen 2.5 14b instruct | GPT 4o-mini | Llama-3.3 70b instruct | Qwen 2.5 72b instruct | GPT 4o |
---|---|---|---|---|---|---|---|
MMLU | 84.8 | 77.9 | 79.9 | 81.8 | 86.3 | 85.3 | 88.1 |
GPQA | 56.1 | 31.2 | 42.9 | 40.9 | 49.1 | 49.0 | 50.6 |
MATH | 80.4 | 44.6 | 75.6 | 73.0 | 66.3 | 80.0 | 74.6 |
HumanEval | 82.6 | 67.8 | 72.1 | 86.2 | 78.9 | 80.4 | 90.6 |
MGSM | 80.6 | 53.5 | 79.6 | 86.5 | 89.1 | 87.3 | 90.4 |
SimpleQA | 3.0 | 7.6 | 5.4 | 9.9 | 20.9 | 10.2 | 39.4 |
DROP | 75.5 | 68.3 | 85.5 | 79.3 | 90.2 | 76.7 | 80.9 |
MMLUPro | 70.4 | 51.3 | 63.2 | 63.4 | 64.4 | 69.6 | 73.0 |
HumanEval+ | 82.8 | 69.2 | 79.1 | 82.0 | 77.9 | 78.4 | 88.0 |
ArenaHard | 75.4 | 45.8 | 70.2 | 76.2 | 65.5 | 78.4 | 75.6 |
LiveBench | 47.6 | 28.1 | 46.6 | 48.1 | 57.6 | 55.3 | 57.6 |
IFEval | 63.0 | 57.9 | 78.7 | 80.0 | 89.3 | 85.0 | 84.8 |
PhiBench (internal) | 56.2 | 43.9 | 49.8 | 58.7 | 57.1 | 64.6 | 72.4 |