Phi-4 AI model for STEM reasoning

by | Jan 9, 2025

Phi-4 is a 14-billion parameter language model introduced by Microsoft Research, developed with a focus on data quality and innovative synthetic data generation. It strategically incorporates synthetic data throughout its training to enhance reasoning and problem-solving capabilities, surpassing its predecessor models, such as Phi-3, and even its teacher model, GPT-4o, in STEM-focused tasks.

This performance is attributed to its training methodology, which emphasizes data quality and incorporates synthetic data throughout the training process. Advancements in post-training techniques further enhance its capabilities, enabling Phi-4 to achieve high-quality results efficiently.

Phi-4 is part of Microsoft’s Phi family of small language models, focusing on achieving high-quality results despite its smaller size. It is available on platforms like Azure AI Foundry and Hugging Face, making it accessible for various applications that require advanced reasoning capabilities.

  • Strengths:
    • Excels in STEM-focused tasks, including advanced math and graduate-level Q&A.
    • High performance in HumanEval coding benchmarks, surpassing many larger models.
    • Competitive with state-of-the-art models like GPT-4o and Qwen-2.5 on reasoning tasks.
  • Limitations:
    • Struggles with strict instruction following, particularly for highly formatted outputs.
    • Occasional factual hallucinations, especially when addressing less common factual queries.
    • Can produce unnecessarily verbose answers for simple queries due to its focus on chain-of-thought reasoning.

Comparison 

Sourced on: January 9, 2025

Benchmarks Where Phi-4 Outperforms Smaller Models:

Phi-4 consistently outperforms smaller models like Phi-3, Qwen 2.5 (14b instruct), and GPT 4o-mini on the following benchmarks:

  1. MMLU (84.8): Strong performance compared to Phi-3 (77.9) and Qwen 2.5 (79.9), demonstrating superior reasoning across a wide range of topics.
  2. GPQA (56.1): A significant lead over Phi-3 (31.2) and GPT 4o-mini (40.9), showing Phi-4’s dominance in STEM graduate-level questions.
  3. MATH (80.4): Exceptional performance compared to Phi-3 (44.6) and Qwen 2.5 (75.6), highlighting its capability in math competition-style reasoning.
  4. HumanEval (82.6): Surpasses smaller models like Phi-3 (67.8) and Qwen 2.5 (72.1), excelling in coding-related tasks.
  5. MGSM (80.6): Outperforms Phi-3 (53.5) and GPT 4o-mini (86.5), reinforcing its problem-solving strength in multi-step reasoning tasks.
  6. DROP (75.5): A strong lead over Phi-3 (68.3), demonstrating superior reading comprehension and reasoning under uncertainty.

Benchmarks Where Phi-4 Outperforms Larger Models:

Phi-4 also achieves competitive or superior performance against larger models like Llama-3.3 (70b instruct), Qwen 2.5 (72b instruct), and GPT 4o:

  1. GPQA (56.1): Outperforms larger models like Llama-3.3 (49.1) and Qwen 2.5 (49.0), showcasing its efficiency in STEM-focused reasoning relative to its size.
  2. MATH (80.4): Performs better than Llama-3.3 (66.3), highlighting its ability to handle complex math problems effectively.
  3. PhiBench (56.2): Demonstrates strong generalization compared to larger models like Llama-3.3 (57.1) and Qwen 2.5 (64.6) while maintaining computational efficiency.
  4. HumanEval+ (82.8): Exceeds Llama-3.3 (77.9), reflecting its strength in advanced coding tasks and evaluations.

Key Highlights:

  • Phi-4’s ability to outperform both smaller and larger models in STEM benchmarks like GPQA and MATH reflects its specialized training with high-quality synthetic data and reasoning-focused datasets.
  • Its efficiency in coding tasks (e.g., HumanEval, HumanEval+) demonstrates well-rounded capabilities, rivaling even much larger models.
  • While it is smaller in size, Phi-4 leverages innovative data and training techniques to achieve performance on par with or better than its larger counterparts in reasoning-heavy benchmarks.

This showcases Phi-4 as a highly optimized model offering competitive performance while remaining cost-effective.

BenchmarkPhi-4 14bPhi-3 14bQwen 2.5 14b instructGPT 4o-miniLlama-3.3 70b instructQwen 2.5 72b instructGPT 4o
MMLU84.877.979.981.886.385.388.1
GPQA56.131.242.940.949.149.050.6
MATH80.444.675.673.066.380.074.6
HumanEval82.667.872.186.278.980.490.6
MGSM80.653.579.686.589.187.390.4
SimpleQA3.07.65.49.920.910.239.4
DROP75.568.385.579.390.276.780.9
MMLUPro70.451.363.263.464.469.673.0
HumanEval+82.869.279.182.077.978.488.0
ArenaHard75.445.870.276.265.578.475.6
LiveBench47.628.146.648.157.655.357.6
IFEval63.057.978.780.089.385.084.8
PhiBench (internal)56.243.949.858.757.164.672.4

Team 

Team Contributions:

  1. Core Development:
    • The team designed and implemented the Phi-4 architecture, based on a decoder-only transformer with 14 billion parameters, and developed innovative post-training techniques like Direct Preference Optimization (DPO) and Pivotal Token Search (PTS).
    • They enhanced the model’s reasoning and problem-solving capabilities by incorporating synthetic data throughout pretraining and midtraining phases.
  2. Data Engineering:
    • Specialists curated high-quality organic data and developed pipelines to generate diverse and nuanced synthetic datasets to emphasize reasoning and problem-solving.
  3. Post-Training and Fine-Tuning:
    • Efforts were focused on supervised fine-tuning, DPO, and mitigation of hallucination issues, ensuring the model adhered to safety standards and user preferences.
  4. Safety and Ethical Alignment:
    • The model’s alignment with Microsoft’s Responsible AI principles was overseen by ethical AI experts, who performed extensive safety checks, red-teaming, and evaluations to reduce risks related to harmful or biased outputs.

Resources

List of resources related to this product.