Glossary

HumanEval

The HumanEval dataset and pass@k metric revolutionize code generation evaluation, focusing on functional correctness. A valuable benchmark for LLMs.

Read More

TriviaQA

Introducing a comprehensive dataset for evaluating question-answering models on trivia questions, testing their knowledge retrieval across various topics.

Read More

NQ

Learn about Google’s Natural Questions benchmark, which assesses a model’s capability to comprehend and respond to real questions asked in search engine queries.

Read More