QIMMA is an Arabic LLM leaderboard created by TII from the UAE, containing more than 52,000 samples across seven domains with rigorous pre-evaluation validation of benchmark items.

QIMMA Leaderboard for Arabic LLMs: Quality Before Quantity

Q: Why is a new Arabic benchmark needed?

Existing Arabic benchmarks suffer from systemic issues including wrong answers, poorly encoded text, grammatical errors and cultural misalignment, which QIMMA's validation explicitly identifies and removes.

The Technology Innovation Institute (TII) from the United Arab Emirates has presented on April 21, 2026 QIMMA (قِمّة, meaning “peak” or “summit” in Arabic), a new public leaderboard for large language models (LLMs) in Arabic. This is the first leaderboard that applies a quality-first approach: all benchmark items undergo rigorous validation before any models are evaluated on them. This methodological reversal addresses a long-standing problem in Arabic NLP (natural language processing), where models were evaluated on sets full of errors.

Why Is a New Arabic Benchmark Needed Right Now?

Arabic is spoken by around 400 million people, but is systematically underrepresented in the LLM ecosystem relative to English. The problem is not just quantity — existing Arabic benchmarks have demonstrated serious systemic issues. QIMMA’s team analyzed 14 source benchmarks and found high rates of rejected samples: ArabicMMLU had 436 problematic items (3.1%), MizanQA had 2.3%, and others showed similar percentages.

Errors included incorrect or mislabeled “correct” answers, corrupted or unreadable text due to encoding problems, spelling errors, stereotypes and cultural misalignment. In practice, this meant that models were for years rewarded for guessing wrong answers and penalized for giving correct ones. QIMMA attempts to break that cycle at the source.

What Domains Are Covered and How Does Validation Work?

QIMMA covers seven domains with a total of 109 subsets and 52,000+ samples, 99% of which are in native Arabic. The domains are carefully chosen to encompass both universal and culturally specific areas: cultural topics (AraDiCE-Culture, ArabCulture, PalmX), STEM (ArabicMMLU, GAT), law (ArabLegalQA, MizanQA), medicine (MedArabiQ, MedAraBench), safety (AraTrust), poetry and literature (FannOrFlop) and programming (3LM HumanEval+ and MBPP+).

Validation proceeds in two stages. In the first stage, two independent large models — Qwen3-235B and DeepSeek-V3-671B — score each sample according to a 10-point rubric covering answer quality, formatting, cultural sensitivity and alignment with the “gold” answer. Items scoring below 7 are removed or sent to the second stage. In the second stage, native Arabic speakers with cultural and dialectal expertise manually review flagged cases, which is essential for domains such as poetry where automated scoring has obvious limitations.

Who Can Submit Models and What Do the Results Show?

The leaderboard is open — developers can submit their own models via the GitHub repository and HuggingFace Spaces interface, and the entire framework uses LightEval for reproducibility. At the top of the first release is Qwen3.5-397B (68.06 average), while the Emirati Jais-2-70B-Chat (from InceptionAI) holds third place with 65.81. An interesting finding is that model size does not guarantee performance: among the top 10 models the range is from 32B to 397B parameters, and mid-sized models frequently outperform larger ones.

QIMMA fits into the broader context of the UAE’s AI strategy, which invests in native Arabic AI infrastructure (Jais, Falcon) as a geopolitical and cultural priority. For the global AI community this is an important step: it shows that multilingual benchmarks can — and must — have higher quality criteria than mere quantity, and that the quality-first methodology can become a standard for other languages that have so far been neglected.

QIMMA: New Leaderboard Puts Quality Before Quantity in Arabic LLM Evaluation

Why Is a New Arabic Benchmark Needed Right Now?

What Domains Are Covered and How Does Validation Work?

Who Can Submit Models and What Do the Results Show?

Sources

Related news