arXiv: mmPISA-bench Tests Reasoning in 43 Languages

The compact multilingual reasoning benchmark mmPISA-bench is derived from OECD PISA testing and covers 43 languages, for a total of 2,150 data points. Modern LLMs reason effectively across all languages, and machine translations perform comparably to human ones. Certain languages simultaneously show higher costs and lower accuracy.

Researchers published a paper on June 5, 2026, on the arXiv repository (identifier arXiv:2606.07069) that introduces mmPISA-bench — a compact multilingual benchmark for reasoning in language models. The measure is derived from the international OECD PISA testing and covers as many as 43 languages, directly probing how uniformly modern models reason regardless of the language of the prompt.

What is mmPISA-bench and where does it come from?

The basis of the benchmark is OECD PISA, the well-known international testing of students’ educational achievement. From it, the authors extracted 25 multiple-choice questions that require genuine reasoning, not mere recall of facts.

These 25 questions were translated into 43 languages in official human translations, and machine translations were added alongside them. The combination of all languages and types of translation gives a total of 2,150 data points. The label “compact” benchmark is well-earned here: the set is deliberately small but carefully constructed to measure precisely the ability to reason.

Do models reason equally well across all languages?

The main finding of the paper is encouraging: modern LLMs reason effectively across all languages, with accuracy that matches human respondents. This means the ability to solve demanding, logically oriented questions is not reserved only for dominant languages like English, but transfers to lower-resource languages as well.

Still, the picture is not entirely uniform. The authors warn that some languages simultaneously show both higher inference costs and lower accuracy — in other words, for certain languages the model consumes more resources yet achieves a weaker result. This asymmetry remains an open area for further improvement.

Are machine translations good enough in quality?

A particularly practical finding concerns machine translations. In the study they perform comparably to human translations, suggesting that the quality of synthetic data (machine-generated) is sufficient for large-scale evaluation.

This matters for the community because building multilingual benchmarks usually depends on expensive and slow human translations. If machine translations yield comparable results, it opens the way to faster and cheaper construction of measures that cover many languages.

Why is this benchmark relevant?

mmPISA-bench fills a gap in evaluation because it focuses on reasoning, not just translation or text comprehension, and does so across a large number of languages at once. It thus gives a clearer picture of whether models’ advanced abilities are truly globally available or concentrated in a handful of languages.

The paper’s conclusions — that models reason effectively everywhere, but with remaining differences in cost and accuracy — also give development teams concrete guidance. Optimizing inference costs for languages that currently lag could be the next step toward truly equitable multilingual reasoning.

It is also worth highlighting the paper’s methodological message. By showing that a compact set of just 25 carefully selected questions, spread across 43 languages, can yield meaningful insights, mmPISA-bench suggests that a quality benchmark need not be large to be useful. Reliance on the recognized OECD PISA source further strengthens the credibility of the questions, since they were already designed to measure genuine reasoning in humans.

Frequently Asked Questions

What is mmPISA-bench?

mmPISA-bench is a compact multilingual reasoning benchmark derived from OECD PISA testing. It consists of 25 multiple-choice questions that require reasoning, translated into 43 languages. Alongside official human translations it also includes machine translations, giving a total of 2,150 data points.

Do models reason equally well across all languages?

According to the results, modern language models reason effectively across all languages, with accuracy that matches human respondents. Still, some languages simultaneously show both higher inference costs and lower accuracy, so differences between languages have not entirely disappeared.

Are machine translations good enough for this kind of evaluation?

Yes. In the mmPISA-bench study, machine translations perform comparably to human ones, showing that the quality of synthetic (machine-translated) data is sufficient for large-scale evaluation. This makes building multilingual benchmarks easier because it does not depend solely on expensive human translations.

arXiv:2606.07069: mmPISA-bench — Do LLMs Reason Equally Well Across 43 Languages?

What is mmPISA-bench and where does it come from?

Do models reason equally well across all languages?

Are machine translations good enough in quality?

Why is this benchmark relevant?

Frequently Asked Questions

Sources

Related news