The DROP (Discrete Reasoning Over Paragraphs) benchmark is a crowdsourced, adversarially-created, 96k-question benchmark. It requires a system to resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs than what was necessary for prior datasets.
The DROP benchmark is primarily used in the following areas:
The DROP benchmark is tested by presenting the questions to the AI systems and evaluating their ability to resolve references in the questions and perform discrete operations over them. The performance of the systems is then measured based on their accuracy in answering these questions.