Glossary

DROP(f1)

Posted by Fede Nolasco | May 18, 2024

DROP is a benchmark that tests AI systems’ ability to perform discrete reasoning over paragraphs. It features 96k crowdsourced questions.

MGSM

Posted by Fede Nolasco | May 18, 2024

MGSM is a benchmark of 250 grade-school math problems, translated into 10 languages. It tests large language models’ multilingual reasoning abilities.

GPQA

Posted by Fede Nolasco | May 18, 2024

GPQA, a benchmark for Large Language Models, consists of 448 expert-crafted questions in biology, physics, and chemistry. It’s designed for scalable oversight experiments and testing AI capabilities.