PHI 3.5 MoE is designed as a multi-lingual, multi-task model with 42B parameters, leveraging a Mixture-of-Experts (MoE) architecture where only 6.6B parameters are activated per inference. It excels in language reasoning, translation, and multi-lingual understanding, optimized for various tasks such as reasoning, summarization, and code generation. The MoE system specializes different experts for distinct tasks such as STEM and Social Sciences, ensuring efficient task handling. The model is known for its safety protocols aligned with Microsoft’s Responsible AI principles, making it reliable for large-scale, ethical AI deployments.
PHI 3.5 MoE Instruct surpasses many large-scale models like Mistral and Llama in specific tasks such as multi-lingual support and reasoning tasks. With 42B parameters in total but activating only 6.6B at a time, it is highly efficient, particularly in language modeling and understanding. It offers a balance of scalability, performance, and safety, in addition to being cost-effective for commercial use. The model has been benchmarked against several high-performance models and shows superior results in multi-lingual and task-specific areas like summarization and code generation.
Category | Benchmark | Phi-3.5-MoE-instruct | Mistral-Nemo-12B-instruct-2407 | Llama-3.1-8B-instruct | Gemma-2-9b-It | Gemini-1.5-Flash | GPT-40-mini-2024-07-18 (Chat) |
---|---|---|---|---|---|---|---|
Popular aggregated benchmark | Arena Hard | 37.9 | 39.4 | 25.7 | 42.0 | 55.2 | 75.0 |
Popular aggregated benchmark | BigBench Hard CoT (0-shot) | 79.1 | 60.2 | 63.4 | 63.5 | 66.7 | 80.4 |
Popular aggregated benchmark | MMLU (5-shot) | 78.9 | 67.2 | 68.1 | 71.3 | 78.7 | 77.2 |
Popular aggregated benchmark | MMLU-Pro (0-shot, CoT) | 54.3 | 40.7 | 44.0 | 50.1 | 57.2 | 62.8 |
Reasoning | ARC Challenge (10-shot) | 91.0 | 84.8 | 83.1 | 89.8 | 92.8 | 93.5 |
Reasoning | BoolQ (2-shot) | 84.6 | 82.5 | 82.8 | 85.7 | 85.8 | 88.7 |
Reasoning | GPQA (0-shot, CoT) | 36.8 | 28.6 | 26.3 | 29.2 | 37.5 | 41.1 |
Reasoning | HellaSwag (5-shot) | 83.8 | 76.7 | 73.5 | 80.9 | 67.5 | 87.1 |
Reasoning | OpenBookQA (10-shot) | 89.6 | 84.4 | 84.8 | 89.6 | 89.0 | 90.0 |
Reasoning | PIQA (5-shot) | 88.6 | 83.5 | 81.2 | 83.7 | 87.5 | 88.7 |
Reasoning | Social IQA (5-shot) | 78.0 | 75.3 | 71.8 | 74.7 | 77.8 | 82.9 |
Reasoning | TruthfulQA (MC2) (10-shot) | 77.5 | 68.1 | 69.2 | 76.6 | 76.6 | 78.2 |
Reasoning | WinoGrande (5-shot) | 81.3 | 70.4 | 64.7 | 74.0 | 74.7 | 76.9 |
Multi-lingual | MMLU (5-shot) | 69.9 | 58.9 | 56.2 | 63.8 | 77.2 | 72.9 |
Math | MGSM (0-shot CoT) | 58.7 | 63.3 | 56.7 | 75.1 | 75.8 | 81.7 |
Math | GSM8K (8-shot, CoT) | 88.7 | 84.2 | 82.4 | 84.9 | 82.4 | 91.3 |
Math | MATH (0-shot, CoT) | 59.5 | 31.2 | 47.6 | 50.9 | 38.0 | 70.2 |
Long context | Qasper | 40.0 | 30.7 | 37.2 | 13.9 | 43.5 | 39.8 |
Long context | SQUALITY | 24.1 | 25.8 | 26.2 | 0.0 | 23.5 | 23.8 |
Code Generation | HumanEval (0-shot) | 70.7 | 63.4 | 66.5 | 61.0 | 74.4 | 86.6 |
Code Generation | MBPP (3-shot) | 80.8 | 68.1 | 69.4 | 69.3 | 77.5 | 84.1 |
Average | Average | 69.2 | 61.3 | 61.0 | 63.3 | 68.5 | 74.9 |
The PHI 3.5 MoE model was developed by a large team of experts at Microsoft, specializing in AI and responsible machine learning. The team worked across multiple global regions, leveraging Microsoft’s AI expertise and focusing on ethical development to ensure robust safety and performance.
Microsoft provides substantial community support for the PHI 3.5 model, including a dedicated GitHub repository and Azure AI Studio resources. The model is available for developers and data scientists to experiment with through accessible API deployments, and the community offers ongoing updates and discussions around best practices.