The Multilingual Grade School Math Benchmark (MGSM) is a dataset designed to evaluate the reasoning abilities of large language models in multilingual settings. It consists of 250 grade-school math problems from the GSM8K dataset, each translated by human annotators into 10 diverse languages. These problems are high-quality and challenging, testing the model’s ability to solve basic mathematical problems that require multi-step reasoning.
The MGSM benchmark is primarily used in the following areas:
The MGSM benchmark is tested by presenting the translated math problems to the language models and evaluating their ability to solve these problems via chain-of-thought prompting. The performance of the models is then measured based on their accuracy in solving these problems.