← Llama 3.2 Lightweight Models for Mobile OpenAI o1-preview benchmark performance →

PHI 3.5 MoE Instruct Model

by Fede Nolasco | Sep 29, 2024

The PHI 3.5 MoE model is a multi-lingual model that uses a Mixture-of-Experts (MoE) approach, featuring 16 experts with 6.6B active parameters out of a total 42B parameters. This allows it to outperform many larger models with high efficiency.

PHI 3.5 MoE is designed as a multi-lingual, multi-task model with 42B parameters, leveraging a Mixture-of-Experts (MoE) architecture where only 6.6B parameters are activated per inference. It excels in language reasoning, translation, and multi-lingual understanding, optimized for various tasks such as reasoning, summarization, and code generation. The MoE system specializes different experts for distinct tasks such as STEM and Social Sciences, ensuring efficient task handling. The model is known for its safety protocols aligned with Microsoft’s Responsible AI principles, making it reliable for large-scale, ethical AI deployments.



42B, LLM



Current



Commercial License



Pretrained, Instruction-tuned



MoE, Phi

Comparison

Sourced on: September 29, 2024

PHI 3.5 MoE Instruct surpasses many large-scale models like Mistral and Llama in specific tasks such as multi-lingual support and reasoning tasks. With 42B parameters in total but activating only 6.6B at a time, it is highly efficient, particularly in language modeling and understanding. It offers a balance of scalability, performance, and safety, in addition to being cost-effective for commercial use. The model has been benchmarked against several high-performance models and shows superior results in multi-lingual and task-specific areas like summarization and code generation.

Category	Benchmark	Phi-3.5-MoE-instruct	Mistral-Nemo-12B-instruct-2407	Llama-3.1-8B-instruct	Gemma-2-9b-It	Gemini-1.5-Flash	GPT-40-mini-2024-07-18 (Chat)
Popular aggregated benchmark	Arena Hard	37.9	39.4	25.7	42.0	55.2	75.0
Popular aggregated benchmark	BigBench Hard CoT (0-shot)	79.1	60.2	63.4	63.5	66.7	80.4
Popular aggregated benchmark	MMLU (5-shot)	78.9	67.2	68.1	71.3	78.7	77.2
Popular aggregated benchmark	MMLU-Pro (0-shot, CoT)	54.3	40.7	44.0	50.1	57.2	62.8
Reasoning	ARC Challenge (10-shot)	91.0	84.8	83.1	89.8	92.8	93.5
Reasoning	BoolQ (2-shot)	84.6	82.5	82.8	85.7	85.8	88.7
Reasoning	GPQA (0-shot, CoT)	36.8	28.6	26.3	29.2	37.5	41.1
Reasoning	HellaSwag (5-shot)	83.8	76.7	73.5	80.9	67.5	87.1
Reasoning	OpenBookQA (10-shot)	89.6	84.4	84.8	89.6	89.0	90.0
Reasoning	PIQA (5-shot)	88.6	83.5	81.2	83.7	87.5	88.7
Reasoning	Social IQA (5-shot)	78.0	75.3	71.8	74.7	77.8	82.9
Reasoning	TruthfulQA (MC2) (10-shot)	77.5	68.1	69.2	76.6	76.6	78.2
Reasoning	WinoGrande (5-shot)	81.3	70.4	64.7	74.0	74.7	76.9
Multi-lingual	MMLU (5-shot)	69.9	58.9	56.2	63.8	77.2	72.9
Math	MGSM (0-shot CoT)	58.7	63.3	56.7	75.1	75.8	81.7
Math	GSM8K (8-shot, CoT)	88.7	84.2	82.4	84.9	82.4	91.3
Math	MATH (0-shot, CoT)	59.5	31.2	47.6	50.9	38.0	70.2
Long context	Qasper	40.0	30.7	37.2	13.9	43.5	39.8
Long context	SQUALITY	24.1	25.8	26.2	0.0	23.5	23.8
Code Generation	HumanEval (0-shot)	70.7	63.4	66.5	61.0	74.4	86.6
Code Generation	MBPP (3-shot)	80.8	68.1	69.4	69.3	77.5	84.1
Average	Average	69.2	61.3	61.0	63.3	68.5	74.9

Team

The PHI 3.5 MoE model was developed by a large team of experts at Microsoft, specializing in AI and responsible machine learning. The team worked across multiple global regions, leveraging Microsoft’s AI expertise and focusing on ethical development to ensure robust safety and performance.

Microsoft AI Team

Community

Microsoft provides substantial community support for the PHI 3.5 model, including a dedicated GitHub repository and Azure AI Studio resources. The model is available for developers and data scientists to experiment with through accessible API deployments, and the community offers ongoing updates and discussions around best practices.

Active Members: 1001-5000 Members

Engagement Level: High Engagement

Resources

List of resources related to this product.

← Llama 3.2 Lightweight Models for Mobile OpenAI o1-preview benchmark performance →