GPQA

GPQA, which stands for Graduate-Level Google-Proof Q&A Benchmark, is a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. It consists of 448 multiple-choice questions that are written by domain experts in biology, physics, and chemistry. These questions are high-quality and extremely difficult, such that experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy. The questions are also difficult for state-of-the-art AI systems, with the strongest GPT-4 based baseline achieving 39% accuracy. The difficulty of GPQA for both skilled non-experts and frontier AI systems enables realistic scalable oversight experiments, which can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

GPQA

Areas of application

  • Evaluating Large Language Models (LLMs): GPQA is used to test the capabilities of LLMs. It challenges these models with 448 expert-crafted multiple-choice questions in biology, physics, and chemistry.
  • Testing Deep Understanding and Reasoning: The benchmark tests LLMs’ deep understanding and reasoning in the domains of biology, physics, and chemistry.
  • Scalable Oversight: GPQA serves as a crucial tool for scalable oversight. It helps in developing methods that enable humans to supervise the outputs of AI systems.
  • Realistic Scalable Oversight Experiments: The difficulty of GPQA for both skilled non-experts and frontier AI systems enables realistic scalable oversight experiments.
  • Reliable Information Extraction: GPQA can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

Example

GPQA benchmark questions are not publicly available due to the nature of the benchmark. The questions are designed to be “Google-proof”, meaning they are extremely difficult and cannot be easily answered with a simple web search. This is to ensure that the benchmark provides a true test of an AI system’s ability to understand and reason in the domains of biology, physics, and chemistry. However, you can get a sense of the benchmark’s structure and difficulty level from the fact that it consists of 448 multiple-choice questions that even experts in the corresponding domains find challenging.

Resources

  • GitHub Repository: This is the official GitHub repository for GPQA. It contains the dataset, baselines, and analysis for the GPQA.
  • Arxiv Paper: This is the official paper published on Arxiv that introduces GPQA.
  • Papers with Code: This page provides a brief overview of the GPQA benchmark.