LLM benchmark

Arc-e

ARC-e is an enhanced version of the ARC Benchmark, evaluating large language models’ reasoning abilities. With 1,169 challenging questions, no model has reached a 75% score yet.

Read More

MBPP

Find the MBPP dataset, featuring Python programming problems focused on code-generation models & basic programming skills. MBPP Benchmark!

Read More