Multi-LCB: LiveCodeBench in 12 Languages, 24 Models

Multi-LCB is an extension of the LiveCodeBench benchmark from Python to 12 programming languages, described in the paper arXiv:2606.20517 and accepted at ICLR 2026. By testing 24 large language models, the authors reveal significant Python overfitting and language-specific data contamination, directly exposing the limits of multilingual code generation in today's models.

The new paper arXiv:2606.20517 presents Multi-LCB, an extension of the popular LiveCodeBench benchmark from Python to 12 programming languages. LiveCodeBench is a test that measures large language models’ ability to write correct code from tasks, and the previous focus on Python left open the question of how well models actually understand other languages. The paper was accepted at ICLR 2026, one of the leading machine learning conferences.

What Was Discovered

By testing 24 large language models, the authors identified two problems. The first is significant Python overfitting — models are considerably more successful in Python than in other languages, suggesting they are adapted to the training data distribution rather than to a general understanding of programming. The second is language-specific data contamination, where certain test tasks were likely already seen during training.

Why Differences Between Languages Matter

Data contamination means benchmark results are inflated because the model “memorizes” solutions rather than deriving them. Unlike single-language tests that hide this difference, Multi-LCB exposes it by comparing performance across 12 languages. The practical implication: a model’s score measured by Python alone overestimates its actual ability to generate code in languages like Rust, Go, or Kotlin.

Implication for Tool Development

The finding is relevant for development teams that rely on AI assistants outside the Python ecosystem. Multi-LCB provides a fairer, multilingual measure and is an open resource for future evaluations, giving model makers a clearer signal on where training on underrepresented languages needs improvement.

Frequently Asked Questions

What is Multi-LCB?

Multi-LCB is an extension of the LiveCodeBench benchmark from Python to 12 programming languages, designed to measure multilingual code generation in large language models.

How many models were tested?

The authors tested 24 large language models and discovered significant Python overfitting and language-specific data contamination.

Where was the paper published?

The paper arXiv:2606.20517 was accepted at ICLR 2026.

arXiv:2606.20517: Multi-LCB Extends LiveCodeBench to 12 Programming Languages and Reveals Python Overfitting in 24 Models

What Was Discovered

Why Differences Between Languages Matter

Implication for Tool Development

Frequently Asked Questions

Sources

Related news