LLM benchmark

Arc-e

Posted by Fede Nolasco | Jan 14, 2024

ARC-e is an enhanced version of the ARC Benchmark, evaluating large language models’ reasoning abilities. With 1,169 challenging questions, no model has reached a 75% score yet.

MATH, large-scale dataset

Posted by Fede Nolasco | Dec 22, 2023

Introducing MATH benchmark: A comprehensive collection of mathematics problems to test model’s ability in understanding and solving various math questions.

MBPP

Posted by Fede Nolasco | Dec 22, 2023

Find the MBPP dataset, featuring Python programming problems focused on code-generation models & basic programming skills. MBPP Benchmark!