HellaSwag

HellaSwag is an acronym for Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations. It is a dataset of 70,000 multiple-choice questions about grounded situations, where the model has to choose the correct ending for an incomplete narrative. The incorrect endings are adversarial generated and human-verified, so they are designed to fool machines but not humans.

Areas of application

HellaSwag is a challenging benchmark for evaluating commonsense natural language inference (NLI) in large language models (LLMs). It tests the model’s ability to complete sentences in a way that makes sense, based on implicit knowledge about the world and human behavior.
HellaSwag is also intended to push the field beyond static benchmarks towards evolving benchmarks, where the dataset co-evolves with the state-of-the-art in an adversarial way, presenting ever-harder challenges

Example

Here is an example question from the HellaSwag dataset:

He is in the middle of a field, playing the bagpipes. He is wearing a kilt and a hat. He stops playing and
- A) starts dancing with a partner.
- B) takes off his hat and bows.
- C) runs away from a swarm of bees.
- D) throws the bagpipes in the air.

The correct answer is C, as it is the most plausible continuation of the scenario, while the other options are either irrelevant or nonsensical.

Resources

WinoGrande →