🤖 24 AI
🟡 💬 Community Tuesday, April 21, 2026 · 3 min read

QIMMA: New Leaderboard Puts Quality Before Quantity in Arabic LLM Evaluation

Editorial illustration: QIMMA: new leaderboard puts quality before quantity in Arabic LLM evaluation

Why it matters

QIMMA is a new Arabic LLM leaderboard published by TII from the UAE, featuring more than 52,000 samples across seven domains with a rigorous two-stage quality validation process for benchmark items before any model evaluation takes place.

The Technology Innovation Institute (TII) from the United Arab Emirates has presented on April 21, 2026 QIMMA (قِمّة, meaning “peak” or “summit” in Arabic), a new public leaderboard for large language models (LLMs) in Arabic. This is the first leaderboard that applies a quality-first approach: all benchmark items undergo rigorous validation before any models are evaluated on them. This methodological reversal addresses a long-standing problem in Arabic NLP (natural language processing), where models were evaluated on sets full of errors.

Why Is a New Arabic Benchmark Needed Right Now?

Arabic is spoken by around 400 million people, but is systematically underrepresented in the LLM ecosystem relative to English. The problem is not just quantity — existing Arabic benchmarks have demonstrated serious systemic issues. QIMMA’s team analyzed 14 source benchmarks and found high rates of rejected samples: ArabicMMLU had 436 problematic items (3.1%), MizanQA had 2.3%, and others showed similar percentages.

Errors included incorrect or mislabeled “correct” answers, corrupted or unreadable text due to encoding problems, spelling errors, stereotypes and cultural misalignment. In practice, this meant that models were for years rewarded for guessing wrong answers and penalized for giving correct ones. QIMMA attempts to break that cycle at the source.

What Domains Are Covered and How Does Validation Work?

QIMMA covers seven domains with a total of 109 subsets and 52,000+ samples, 99% of which are in native Arabic. The domains are carefully chosen to encompass both universal and culturally specific areas: cultural topics (AraDiCE-Culture, ArabCulture, PalmX), STEM (ArabicMMLU, GAT), law (ArabLegalQA, MizanQA), medicine (MedArabiQ, MedAraBench), safety (AraTrust), poetry and literature (FannOrFlop) and programming (3LM HumanEval+ and MBPP+).

Validation proceeds in two stages. In the first stage, two independent large models — Qwen3-235B and DeepSeek-V3-671B — score each sample according to a 10-point rubric covering answer quality, formatting, cultural sensitivity and alignment with the “gold” answer. Items scoring below 7 are removed or sent to the second stage. In the second stage, native Arabic speakers with cultural and dialectal expertise manually review flagged cases, which is essential for domains such as poetry where automated scoring has obvious limitations.

Who Can Submit Models and What Do the Results Show?

The leaderboard is open — developers can submit their own models via the GitHub repository and HuggingFace Spaces interface, and the entire framework uses LightEval for reproducibility. At the top of the first release is Qwen3.5-397B (68.06 average), while the Emirati Jais-2-70B-Chat (from InceptionAI) holds third place with 65.81. An interesting finding is that model size does not guarantee performance: among the top 10 models the range is from 32B to 397B parameters, and mid-sized models frequently outperform larger ones.

QIMMA fits into the broader context of the UAE’s AI strategy, which invests in native Arabic AI infrastructure (Jais, Falcon) as a geopolitical and cultural priority. For the global AI community this is an important step: it shows that multilingual benchmarks can — and must — have higher quality criteria than mere quantity, and that the quality-first methodology can become a standard for other languages that have so far been neglected.

🤖

This article was generated using artificial intelligence from primary sources.