This novel Sparse Mixture of Experts (SMoE) language model, akin to Mistral 7B in architecture, boasts 8 feedforward blocks (experts) in each layer. Remarkably, each token, during its processing, engages with two experts, allowing access to 47B parameters while actively utilizing only 13B. Trained with a 32k token context size, Mixtral surpasses the performance of prominent models like Llama 2 70B and GPT-3.5 across multiple benchmarks.
Mixtral 8x7B demonstrates exceptional prowess in mathematics, code generation, and multilingual tasks, significantly outshining Llama 2 70B. Its “Instruct” version, designed for following instructions, eclipses competitors like GPT-3.5 Turbo, Claude-2.1, and Gemini Pro in human evaluation benchmarks.
This table reflects the comparison of Mixtral with Llama, where Mixtral either outperforms or matches the Llama 2 70B model on almost all popular benchmarks while using significantly fewer active parameters during inference.
Note that Mixtral 8x7B has 13B active params.
Benchmark | LLaMA 2 7B | LLaMA 2 13B | LLaMA 1 33B | LLaMA 2 70B | Mistral 7B | Mixtral 8x7B |
---|---|---|---|---|---|---|
MMLU | 44.40% | 55.60% | 56.80% | 69.90% | 62.50% | 70.60% |
HellaSwag | 77.10% | 80.70% | 83.70% | 85.40% | 81.00% | 84.40% |
WinoGrande | 69.50% | 72.90% | 76.20% | 80.40% | 74.20% | 77.20% |
PIQA | 77.90% | 80.80% | 82.20% | 82.60% | 82.20% | 83.60% |
Arc-e | 68.70% | 75.20% | 79.60% | 79.90% | 80.50% | 83.10% |
Arc-c | 43.20% | 48.80% | 54.40% | 56.50% | 54.90% | 59.70% |
NQ | 17.50% | 16.70% | 24.10% | 25.40% | 23.20% | 30.60% |
TriviaQA | 56.60% | 64.00% | 68.50% | 73.00% | 62.50% | 71.50% |
HumanEval | 11.60% | 18.90% | 25.00% | 29.30% | 26.20% | 40.20% |
MBPP | 26.10% | 35.40% | 40.90% | 49.80% | 50.20% | 60.70% |
Math | 3.90% | 6.00% | 8.40% | 13.80% | 12.70% | 28.40% |
GSM8K | 16.00% | 34.30% | 44.10% | 69.60% | 50.00% | 74.40% |
The team, comprising Albert Q. Jiang, Alexandre Sablayrolles, and their colleagues, demonstrates a commitment to advancing AI. Their collaborative effort hints at further investments in refining Mixtral, aiming to leverage its sparse parameter utilization for broader applications.
The Mixtral community is burgeoning, with active engagement and contributions. Key community links include the project’s Huggingface model page with over 116K downloads.