A dataset designed to evaluate the code generation capabilities of large language models (LLMs).
The HumanEval benchmark consists of 164 hand-crafted programming challenges, each including a function signature, docstring, body, and several unit tests.