LLM benchmark

BigBench: Capabilities and biases of large language models

Posted by Fede Nolasco | Jan 22, 2024

BIG-Bench: A new benchmark for assessing LLMs, offering challenging, long-lasting tests to evaluate capabilities and biases comprehensively.

AgiEval: Human-centric Benchmark

Posted by Fede Nolasco | Jan 22, 2024

AGIEval is a human-centric benchmark that evaluates the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.

Arc-c

Posted by Fede Nolasco | Jan 14, 2024

ARC-c is a challenging variation of the ARC Benchmark, designed to assess the reasoning and commonsense understanding of large language models. Learn more about this dataset and the difficulty it presents for models.