← OpenAI o3 Benchmark Performance DeepSeek R1 →

Phi-4 AI model for STEM reasoning

by Fede Nolasco | Jan 9, 2025

Phi-4 is a 14-billion parameter language model introduced by Microsoft Research, developed with a focus on data quality and innovative synthetic data generation. It strategically incorporates synthetic data throughout its training to enhance reasoning and problem-solving capabilities, surpassing its predecessor models, such as Phi-3, and even its teacher model, GPT-4o, in STEM-focused tasks.

This performance is attributed to its training methodology, which emphasizes data quality and incorporates synthetic data throughout the training process. Advancements in post-training techniques further enhance its capabilities, enabling Phi-4 to achieve high-quality results efficiently.

Phi-4 is part of Microsoft’s Phi family of small language models, focusing on achieving high-quality results despite its smaller size. It is available on platforms like Azure AI Foundry and Hugging Face, making it accessible for various applications that require advanced reasoning capabilities.

Strengths:
- Excels in STEM-focused tasks, including advanced math and graduate-level Q&A.
- High performance in HumanEval coding benchmarks, surpassing many larger models.
- Competitive with state-of-the-art models like GPT-4o and Qwen-2.5 on reasoning tasks.
Limitations:
- Struggles with strict instruction following, particularly for highly formatted outputs.
- Occasional factual hallucinations, especially when addressing less common factual queries.
- Can produce unnecessarily verbose answers for simple queries due to its focus on chain-of-thought reasoning.



14b, LLM



Current



MIT License



Pretrained, Instruction-tuned



AI reasoning capabilities, Direct Preference Optimization, GPT-4o teacher model, Microsoft Phi-4 model, Phi-4, STEM-focused AI, synthetic data in AI

Comparison

Sourced on: January 9, 2025

Benchmarks Where Phi-4 Outperforms Smaller Models:

Phi-4 consistently outperforms smaller models like Phi-3, Qwen 2.5 (14b instruct), and GPT 4o-mini on the following benchmarks:

MMLU (84.8): Strong performance compared to Phi-3 (77.9) and Qwen 2.5 (79.9), demonstrating superior reasoning across a wide range of topics.
GPQA (56.1): A significant lead over Phi-3 (31.2) and GPT 4o-mini (40.9), showing Phi-4’s dominance in STEM graduate-level questions.
MATH (80.4): Exceptional performance compared to Phi-3 (44.6) and Qwen 2.5 (75.6), highlighting its capability in math competition-style reasoning.
HumanEval (82.6): Surpasses smaller models like Phi-3 (67.8) and Qwen 2.5 (72.1), excelling in coding-related tasks.
MGSM (80.6): Outperforms Phi-3 (53.5) and GPT 4o-mini (86.5), reinforcing its problem-solving strength in multi-step reasoning tasks.
DROP (75.5): A strong lead over Phi-3 (68.3), demonstrating superior reading comprehension and reasoning under uncertainty.

Benchmarks Where Phi-4 Outperforms Larger Models:

Phi-4 also achieves competitive or superior performance against larger models like Llama-3.3 (70b instruct), Qwen 2.5 (72b instruct), and GPT 4o:

GPQA (56.1): Outperforms larger models like Llama-3.3 (49.1) and Qwen 2.5 (49.0), showcasing its efficiency in STEM-focused reasoning relative to its size.
MATH (80.4): Performs better than Llama-3.3 (66.3), highlighting its ability to handle complex math problems effectively.
PhiBench (56.2): Demonstrates strong generalization compared to larger models like Llama-3.3 (57.1) and Qwen 2.5 (64.6) while maintaining computational efficiency.
HumanEval+ (82.8): Exceeds Llama-3.3 (77.9), reflecting its strength in advanced coding tasks and evaluations.

Key Highlights:

Phi-4’s ability to outperform both smaller and larger models in STEM benchmarks like GPQA and MATH reflects its specialized training with high-quality synthetic data and reasoning-focused datasets.
Its efficiency in coding tasks (e.g., HumanEval, HumanEval+) demonstrates well-rounded capabilities, rivaling even much larger models.
While it is smaller in size, Phi-4 leverages innovative data and training techniques to achieve performance on par with or better than its larger counterparts in reasoning-heavy benchmarks.

This showcases Phi-4 as a highly optimized model offering competitive performance while remaining cost-effective.

Benchmark	Phi-4 14b	Phi-3 14b	Qwen 2.5 14b instruct	GPT 4o-mini	Llama-3.3 70b instruct	Qwen 2.5 72b instruct	GPT 4o
MMLU	84.8	77.9	79.9	81.8	86.3	85.3	88.1
GPQA	56.1	31.2	42.9	40.9	49.1	49.0	50.6
MATH	80.4	44.6	75.6	73.0	66.3	80.0	74.6
HumanEval	82.6	67.8	72.1	86.2	78.9	80.4	90.6
MGSM	80.6	53.5	79.6	86.5	89.1	87.3	90.4
SimpleQA	3.0	7.6	5.4	9.9	20.9	10.2	39.4
DROP	75.5	68.3	85.5	79.3	90.2	76.7	80.9
MMLUPro	70.4	51.3	63.2	63.4	64.4	69.6	73.0
HumanEval+	82.8	69.2	79.1	82.0	77.9	78.4	88.0
ArenaHard	75.4	45.8	70.2	76.2	65.5	78.4	75.6
LiveBench	47.6	28.1	46.6	48.1	57.6	55.3	57.6
IFEval	63.0	57.9	78.7	80.0	89.3	85.0	84.8
PhiBench (internal)	56.2	43.9	49.8	58.7	57.1	64.6	72.4

Team

Team Contributions:

Core Development:
- The team designed and implemented the Phi-4 architecture, based on a decoder-only transformer with 14 billion parameters, and developed innovative post-training techniques like Direct Preference Optimization (DPO) and Pivotal Token Search (PTS).
- They enhanced the model’s reasoning and problem-solving capabilities by incorporating synthetic data throughout pretraining and midtraining phases.
Data Engineering:
- Specialists curated high-quality organic data and developed pipelines to generate diverse and nuanced synthetic datasets to emphasize reasoning and problem-solving.
Post-Training and Fine-Tuning:
- Efforts were focused on supervised fine-tuning, DPO, and mitigation of hallucination issues, ensuring the model adhered to safety standards and user preferences.
Safety and Ethical Alignment:
- The model’s alignment with Microsoft’s Responsible AI principles was overseen by ethical AI experts, who performed extensive safety checks, red-teaming, and evaluations to reduce risks related to harmful or biased outputs.

Microsoft Research

Resources

List of resources related to this product.

← OpenAI o3 Benchmark Performance DeepSeek R1 →