MGSM

The Multilingual Grade School Math Benchmark (MGSM) is a dataset designed to evaluate the reasoning abilities of large language models in multilingual settings. It consists of 250 grade-school math problems from the GSM8K dataset, each translated by human annotators into 10 diverse languages. These problems are high-quality and challenging, testing the model’s ability to solve basic mathematical problems that require multi-step reasoning.

Areas of application

The MGSM benchmark is primarily used in the following areas:

Evaluating the multilingual reasoning abilities of large language models.
Testing the models’ ability to solve basic mathematical problems that require multi-step reasoning.
Assessing the performance of these models in underrepresented languages such as Bengali and Swahili.

Example

The MGSM benchmark is tested by presenting the translated math problems to the language models and evaluating their ability to solve these problems via chain-of-thought prompting. The performance of the models is then measured based on their accuracy in solving these problems.

Resources

← GPQA DROP(f1) →