Evaluation
Benchmark
A standardized test or dataset that measures and compares an AI model's capability — for example MMLU, GPQA, SWE-bench, HumanEval, and MMMU.
A benchmark is a standardized test or dataset used to measure and objectively compare an AI model’s capability on a given task. Each benchmark fixes a set of questions or problems and a scoring method (usually the percentage of correct answers), so different models receive comparable results.
Well-known benchmarks cover different skills: MMLU tests academic knowledge across 57 subjects, GPQA poses PhD-level science questions, SWE-bench asks models to fix real GitHub bugs, HumanEval measures program synthesis, and MMMU evaluates multimodal image-and-text reasoning. Scores are published in the system cards that accompany each new frontier model.
Benchmarks are central to AI evaluation in 2025–2026, but they have limits. Older tests like MMLU are now saturated (leading models exceed 90%), and there is a risk of contamination — test questions leaking into training data. A high score does not guarantee real-world reliability or freedom from hallucination, so the field keeps building harder, more realistic tests, especially for reasoning models.