Glossary

DROP(f1)

DROP is a benchmark that tests AI systems’ ability to perform discrete reasoning over paragraphs. It features 96k crowdsourced questions.

Read More

MGSM

MGSM is a benchmark of 250 grade-school math problems, translated into 10 languages. It tests large language models’ multilingual reasoning abilities.

Read More

GPQA

GPQA, a benchmark for Large Language Models, consists of 448 expert-crafted questions in biology, physics, and chemistry. It’s designed for scalable oversight experiments and testing AI capabilities.

Read More