GPQA, which stands for Graduate-Level Google-Proof Q&A Benchmark, is a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. It consists of 448 multiple-choice questions that are written by domain experts in biology, physics, and chemistry. These questions are high-quality and extremely difficult, such that experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy. The questions are also difficult for state-of-the-art AI systems, with the strongest GPT-4 based baseline achieving 39% accuracy. The difficulty of GPQA for both skilled non-experts and frontier AI systems enables realistic scalable oversight experiments, which can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.
GPQA benchmark questions are not publicly available due to the nature of the benchmark. The questions are designed to be “Google-proof”, meaning they are extremely difficult and cannot be easily answered with a simple web search. This is to ensure that the benchmark provides a true test of an AI system’s ability to understand and reason in the domains of biology, physics, and chemistry. However, you can get a sense of the benchmark’s structure and difficulty level from the fact that it consists of 448 multiple-choice questions that even experts in the corresponding domains find challenging.