Glossary

DROP(f1)

DROP is a benchmark that tests AI systems’ ability to perform discrete reasoning over paragraphs. It features 96k crowdsourced questions.

Read More

MGSM

MGSM is a benchmark of 250 grade-school math problems, translated into 10 languages. It tests large language models’ multilingual reasoning abilities.

Read More

GPQA

GPQA, a benchmark for Large Language Models, consists of 448 expert-crafted questions in biology, physics, and chemistry. It’s designed for scalable oversight experiments and testing AI capabilities.

Read More

LLM Model size

Large language models have billions of parameters that enable them to understand and generate natural language texts. Learn more about them here.

Read More