The Phi-3-Mini-4K-Instruct is a cutting-edge language model designed for both commercial and research applications. It is a part of the Phi-3 series, and this particular variant boasts 3.8 billion parameters. The model is trained on the expansive Phi-3 datasets, which include synthetic data and high-quality, filtered web data. Its training emphasizes reasoning density and quality, making it adept at handling complex language tasks.
This model has undergone rigorous post-training processes, including supervised fine-tuning and direct preference optimization, to ensure it follows instructions accurately and maintains safety standards. When benchmarked, the Phi-3 Mini-4K-Instruct demonstrates superior performance in areas such as common sense, language understanding, math, code, and logical reasoning, especially when compared to other models with fewer than 13 billion parameters.
Developers should note that while the model is versatile, it is not tailored for all possible use cases. It is crucial to evaluate the model for accuracy, safety, and fairness within the specific context of its application, particularly in high-risk scenarios. Additionally, developers must comply with relevant laws and regulations, including those related to privacy and trade compliance.
The Phi-3 Mini-4K-Instruct is integrated into the development version of the transformers library and is also available on HuggingChat. It supports a vocabulary size of up to 32,064 tokens and is optimized for chat-format prompts. The model is licensed under the MIT license, and while it includes various trademarks, their use must align with Microsoft’s Trademark & Brand Guidelines.
For those looking to implement the model, it is compatible with multi-GPU setups and can be run on specific GPU hardware types. It also supports ONNX runtime across various platforms and hardware, ensuring broad accessibility and optimization for different devices.
The Phi-3-Mini-4K-Instruct model, despite having only 3.8 billion parameters, demonstrates remarkable performance across various benchmarks, often outperforming larger models. Here are the key highlights:
1) MMLU (5-Shot): Phi-3-Mini-4K-Instruct scores 68.8 compared to GPT-3.5’s 71.4 despite the latter having more than triple the parameters.
2) HellaSwag (5-Shot): It achieves 76.7, which is competitive with larger models like GPT-3.5’s 78.8.
3) GSM-8K (0-Shot; CoT): The model excels with 82.5, outshining Mistral’s 46.4 and closely following GPT-3.5’s 78.1.
4) TriviaQA (5-Shot): Phi-3-Mini-4K-Instruct’s 64.0 is noteworthy, especially when compared to larger models like Llama-3-In’s 75.2 and Mixtral’s 82.2.
Overall, Phi-3-Mini-4K-Instruct’s performance is impressive, especially considering its smaller size relative to other models. It showcases the efficiency of its design and training, making it a robust choice for various applications.
Benchmark | Phi-3-Mini-4K-In 3.8b | Phi-3-Small 7b (preview) | Phi-3-Medium 14b (preview) | Phi-2 2.7b | Mistral 7b | Gemma 7b | Llama-3-In 8b | Mixtral 8x7b | GPT-3.5 version 1106 |
---|---|---|---|---|---|---|---|---|---|
MMLU 5-Shot | 68.8 | 75.3 | 78.2 | 56.3 | 61.7 | 63.6 | 66.5 | 68.4 | 71.4 |
HellaSwag 5-Shot | 76.7 | 78.7 | 83.2 | 53.6 | 58.5 | 49.8 | 71.1 | 70.4 | 78.8 |
ANLI 7-Shot | 52.8 | 55 | 58.7 | 42.5 | 47.1 | 48.7 | 57.3 | 55.2 | 58.1 |
GSM-8K 0-Shot; CoT | 82.5 | 86.4 | 90.8 | 61.1 | 46.4 | 59.8 | 77.4 | 64.7 | 78.1 |
MedQA 2-Shot | 53.8 | 58.2 | 69.8 | 40.9 | 49.6 | 50 | 60.5 | 62.2 | 63.4 |
AGIEval 0-Shot | 37.5 | 45 | 49.7 | 29.8 | 35.1 | 42.1 | 42 | 45.2 | 48.4 |
TriviaQA 5-Shot | 64 | 59.1 | 73.3 | 45.2 | 72.3 | 75.2 | 67.7 | 82.2 | 85.8 |
Arc-C 10-Shot | 84.9 | 90.7 | 91.9 | 75.9 | 78.6 | 78.3 | 82.8 | 87.3 | 87.4 |
Arc-E 10-Shot | 94.6 | 97.1 | 98 | 88.5 | 90.6 | 91.4 | 93.4 | 95.6 | 96.3 |
PIQA 5-Shot | 84.2 | 87.8 | 88.2 | 60.2 | 77.7 | 78.1 | 75.7 | 86 | 86.6 |
SociQA 5-Shot | 76.6 | 79 | 79.4 | 68.3 | 74.6 | 65.5 | 73.9 | 75.9 | 68.3 |
BigBench-Hard 0-Shot | 71.7 | 75 | 82.5 | 59.4 | 57.3 | 59.6 | 51.5 | 69.7 | 68.32 |
WinoGrande 5-Shot | 70.8 | 82.5 | 81.2 | 54.7 | 54.2 | 55.6 | 65 | 62 | 68.8 |
OpenBookQA 10-Shot | 83.2 | 88.4 | 86.6 | 73.6 | 79.8 | 78.6 | 82.6 | 85.8 | 86 |
BoolQ 0-Shot | 77.6 | 82.9 | 86.5 | 72.2 | 66 | 80.9 | 77.6 | 79.1 | |
CommonSenseQA 10-Shot | 80.2 | 80.3 | 82.6 | 69.3 | 72.6 | 76.2 | 79 | 78.1 | 79.6 |
TruthfulQA 10-Shot | 65 | 68.1 | 74.8 | 52.1 | 53 | 63.2 | 60.1 | 85.8 | |
HumanEval 0-Shot | 59.1 | 59.1 | 54.7 | 47 | 28 | 34.1 | 60.4 | 37.8 | 62.2 |
MBPP 3-Shot | 53.8 | 71.4 | 73.7 | 60.6 | 50.8 | 51.5 | 67.7 | 60.2 | 77.8 |
The team behind the Large Language Model (LLM) mentioned on the current page is from Microsoft, a verified entity with a strong presence in AI and ML research. This team, comprising 1405 members, has contributed to various projects, including the development of state-of-the-art models and frameworks. One of their notable contributions is the SpeechT5 framework, which addresses multiple audio-related tasks through a unified seq2seq model complemented by modal-specific pre/post-nets1. Another significant project is TAPEX, a pre-training model designed for table-based question answering and fact verification, showcasing their expertise in handling structured data. The team’s work reflects a commitment to advancing the field of machine learning, particularly in natural language processing and speech synthesis, as evidenced by their extensive research and model updates. Their collaborative efforts have resulted in a collection of models and datasets that serve as valuable resources for the broader AI community.