← Memory-Augmented Neural Networks (Manns) Mt-Bench (Multi-Turn Benchmark) →

MMLU

The MMLU Benchmark, or Massive Multi-task Language Understanding, is an LLM evaluation test dataset split into a few-shot development set, a 1540-question validation set, and a 14079-question test set that measures text models’ multitask accuracy across 57 tasks like math, history, law, and computer science in zero-shot and few-shot settings to evaluate their world knowledge, problem-solving skills, and limitations.

Areas of application

language model evaluation
natural language processing
computer science
problem-solving skills
world knowledge
few-shot learning
zero-shot learning

Example

The MMLU Benchmark is a widely used evaluation tool for language models to test their abilities in various tasks beyond language understanding, such as math, history, law, and computer science. For instance, a language model can be tested on its ability to solve a mathematical problem or understand historical events by using the MMLU Benchmark.

Resources

Effective Problem Formulation

← Memory-Augmented Neural Networks (Manns) Mt-Bench (Multi-Turn Benchmark) →