HellaSwag

HellaSwag is an acronym for Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations. It is a dataset of 70,000 multiple-choice questions about grounded situations, where the model has to choose the correct ending for an incomplete narrative. The incorrect endings are adversarial generated and human-verified, so they are designed to fool machines but not humans.

HellaSwag

Areas of application

  • HellaSwag is a challenging benchmark for evaluating commonsense natural language inference (NLI) in large language models (LLMs). It tests the model’s ability to complete sentences in a way that makes sense, based on implicit knowledge about the world and human behavior.
  • HellaSwag is also intended to push the field beyond static benchmarks towards evolving benchmarks, where the dataset co-evolves with the state-of-the-art in an adversarial way, presenting ever-harder challenges

Example

Here is an example question from the HellaSwag dataset:

  • He is in the middle of a field, playing the bagpipes. He is wearing a kilt and a hat. He stops playing and
    • A) starts dancing with a partner.
    • B) takes off his hat and bows.
    • C) runs away from a swarm of bees.
    • D) throws the bagpipes in the air.

The correct answer is C, as it is the most plausible continuation of the scenario, while the other options are either irrelevant or nonsensical.