The Phi-3-Mini-128K-Instruct is a cutting-edge language model designed for both commercial and research applications. It’s part of the Phi-3 family and stands out with its 3.8 billion parameters, making it a lightweight yet powerful tool for various AI-powered features. The model’s training involved a combination of synthetic data and high-quality website data, ensuring a focus on reasoning and comprehension.
Post-training, the model was fine-tuned and optimized for instruction-following and safety adherence. Its performance is state-of-the-art, especially in areas like common sense, language understanding, and logical reasoning, where it competes well with models up to 13 billion parameters.
Developers should note that while the model excels in English language tasks, it’s not tailored for all scenarios. Accuracy, safety, and fairness evaluations are crucial, particularly in high-risk situations. Compliance with relevant laws and regulations is also essential.
The model integrates seamlessly with the transformers library and supports a vocabulary size of up to 32064 tokens. It’s optimized for chat-format prompts and can be run on specific GPU hardware or via ONNX across various platforms.
In summary, the Phi-3-Mini-128K-Instruct is a versatile and robust model that pushes the boundaries of AI research and application, provided it’s used responsibly and within legal frameworks.
The Phi-3 Mini-128K-Instruct, part of the Phi-3 family, is a 3.8 billion-parameter model that has demonstrated impressive performance across various benchmarks. Here’s a comparison with other language models:
1) MMLU (5-Shot): Phi-3 Mini-128K-Instruct scored 68.1 compared to GPT-3.5’s 71.4 despite having fewer parameters.
2) HellaSwag (5-Shot): Achieved 74.5, which is competitive with larger models like GPT-3.5’s 78.8.
3) ANLI (7-Shot): Scored 52.8, showcasing its robust reasoning abilities compared to GPT-3.5’s 58.1.
4) GSM-8K (0-Shot; CoT): Excelled with 83.6, indicating strong mathematical reasoning, higher than GPT-3.5’s 78.1.
5) MedQA (2-Shot): Scored 55.3, reflecting its medical knowledge, which is close to GPT-3.5’s 63.4.
6) AGIEval (0-Shot): With 36.9, it shows potential in general AI evaluation, nearing GPT-3.5’s 48.4.
Overall, the Phi-3 Mini-128K-Instruct stands out for its efficiency and capability, achieving comparable or superior results to larger models like GPT-3.5 in several benchmarks, highlighting its state-of-the-art performance among models with fewer than 13 billion parameters. This makes it a valuable asset for both commercial and research applications, especially considering its ability to understand and generate human-like text.
Benchmark | Phi-3-Mini-128K-In 3.8b | Phi-3-Small 7b (preview) | Phi-3-Medium 14b (preview) | Phi-2 2.7b | Mistral 7b | Gemma 7b | Llama-3-In 8b | Mixtral 8x7b | GPT-3.5 version 1106 |
---|---|---|---|---|---|---|---|---|---|
MMLU 5-Shot | 68.1 | 75.3 | 78.2 | 56.3 | 61.7 | 63.6 | 66.5 | 68.4 | 71.4 |
HellaSwag 5-Shot | 74.5 | 78.7 | 83.2 | 53.6 | 58.5 | 49.8 | 71.1 | 70.4 | 78.8 |
ANLI 7-Shot | 52.8 | 55 | 58.7 | 42.5 | 47.1 | 48.7 | 57.3 | 55.2 | 58.1 |
GSM-8K 0-Shot; CoT | 83.6 | 86.4 | 90.8 | 61.1 | 46.4 | 59.8 | 77.4 | 64.7 | 78.1 |
MedQA 2-Shot | 55.3 | 58.2 | 69.8 | 40.9 | 49.6 | 50 | 60.5 | 62.2 | 63.4 |
AGIEval 0-Shot | 36.9 | 45 | 49.7 | 29.8 | 35.1 | 42.1 | 42 | 45.2 | 48.4 |
TriviaQA 5-Shot | 57.1 | 59.1 | 73.3 | 45.2 | 72.3 | 75.2 | 67.7 | 82.2 | 85.8 |
Arc-C 10-Shot | 84 | 90.7 | 91.9 | 75.9 | 78.6 | 78.3 | 82.8 | 87.3 | 87.4 |
Arc-E 10-Shot | 95.2 | 97.1 | 98 | 88.5 | 90.6 | 91.4 | 93.4 | 95.6 | 96.3 |
PIQA 5-Shot | 83.6 | 87.8 | 88.2 | 60.2 | 77.7 | 78.1 | 75.7 | 86 | 86.6 |
SociQA 5-Shot | 76.1 | 79 | 79.4 | 68.3 | 74.6 | 65.5 | 73.9 | 75.9 | 68.3 |
BigBench-Hard 0-Shot | 71.5 | 75 | 82.5 | 59.4 | 57.3 | 59.6 | 51.5 | 69.7 | 68.32 |
WinoGrande 5-Shot | 72.5 | 82.5 | 81.2 | 54.7 | 54.2 | 55.6 | 65 | 62 | 68.8 |
OpenBookQA 10-Shot | 80.6 | 88.4 | 86.6 | 73.6 | 79.8 | 78.6 | 82.6 | 85.8 | 86 |
BoolQ 0-Shot | 78.7 | 82.9 | 86.5 | 72.2 | 66 | 80.9 | 77.6 | 79.1 | |
CommonSenseQA 10-Shot | 78 | 80.3 | 82.6 | 69.3 | 72.6 | 76.2 | 79 | 78.1 | 79.6 |
TruthfulQA 10-Shot | 63.2 | 68.1 | 74.8 | 52.1 | 53 | 63.2 | 60.1 | 85.8 | |
HumanEval 0-Shot | 57.9 | 59.1 | 54.7 | 47 | 28 | 34.1 | 60.4 | 37.8 | 62.2 |
MBPP 3-Shot | 62.5 | 71.4 | 73.7 | 60.6 | 50.8 | 51.5 | 67.7 | 60.2 | 77.8 |
The team behind the Large Language Model (LLM) mentioned on the current page is from Microsoft, a verified entity with a strong presence in AI and ML research. This team, comprising 1405 members, has contributed to various projects, including the development of state-of-the-art models and frameworks. One of their notable contributions is the SpeechT5 framework, which addresses multiple audio-related tasks through a unified seq2seq model complemented by modal-specific pre/post-nets1. Another significant project is TAPEX, a pre-training model designed for table-based question answering and fact verification, showcasing their expertise in handling structured data. The team’s work reflects a commitment to advancing the field of machine learning, particularly in natural language processing and speech synthesis, as evidenced by their extensive research and model updates. Their collaborative efforts have resulted in a collection of models and datasets that serve as valuable resources for the broader AI community.