DROP(f1) Posted by Fede Nolasco | May 18, 2024 DROP is a benchmark that tests AI systems’ ability to perform discrete reasoning over paragraphs. It features 96k crowdsourced questions. Read More
MGSM Posted by Fede Nolasco | May 18, 2024 MGSM is a benchmark of 250 grade-school math problems, translated into 10 languages. It tests large language models’ multilingual reasoning abilities. Read More
GPQA Posted by Fede Nolasco | May 18, 2024 GPQA, a benchmark for Large Language Models, consists of 448 expert-crafted questions in biology, physics, and chemistry. It’s designed for scalable oversight experiments and testing AI capabilities. Read More