MMLU

The MMLU Benchmark, or Massive Multi-task Language Understanding, is an LLM evaluation test dataset split into a few-shot development set, a 1540-question validation set, and a 14079-question test set that measures text models’ multitask accuracy across 57 tasks like math, history, law, and computer science in zero-shot and few-shot settings to evaluate their world knowledge, problem-solving skills, and limitations.

MMLU Benchmark (Massive Multi-Task Language Understanding)

Areas of application

  • language model evaluation
  • natural language processing
  • computer science
  • problem-solving skills
  • world knowledge
  • few-shot learning
  • zero-shot learning

Example

The MMLU Benchmark is a widely used evaluation tool for language models to test their abilities in various tasks beyond language understanding, such as math, history, law, and computer science. For instance, a language model can be tested on its ability to solve a mathematical problem or understand historical events by using the MMLU Benchmark.