ARC-c is a challenging variation of the ARC Benchmark, designed to assess the reasoning and commonsense understanding of large language models. Learn more about this dataset and the difficulty it presents for models.
ARC-e is an enhanced version of the ARC Benchmark, evaluating large language models’ reasoning abilities. With 1,169 challenging questions, no model has reached a 75% score yet.