arXiv:2606.20517: Multi-LCB Extends LiveCodeBench to 12 Programming Languages and Reveals Python Overfitting in 24 Models
Multi-LCB is an extension of the LiveCodeBench benchmark from Python to 12 programming languages, described in the paper arXiv:2606.20517 and accepted at ICLR 2026. By testing 24 large language models, the authors reveal significant Python overfitting and language-specific data contamination, directly exposing the limits of multilingual code generation in today's models.
This article was generated using artificial intelligence from primary sources.
The new paper arXiv:2606.20517 presents Multi-LCB, an extension of the popular LiveCodeBench benchmark from Python to 12 programming languages. LiveCodeBench is a test that measures large language models’ ability to write correct code from tasks, and the previous focus on Python left open the question of how well models actually understand other languages. The paper was accepted at ICLR 2026, one of the leading machine learning conferences.
What Was Discovered
By testing 24 large language models, the authors identified two problems. The first is significant Python overfitting — models are considerably more successful in Python than in other languages, suggesting they are adapted to the training data distribution rather than to a general understanding of programming. The second is language-specific data contamination, where certain test tasks were likely already seen during training.
Why Differences Between Languages Matter
Data contamination means benchmark results are inflated because the model “memorizes” solutions rather than deriving them. Unlike single-language tests that hide this difference, Multi-LCB exposes it by comparing performance across 12 languages. The practical implication: a model’s score measured by Python alone overestimates its actual ability to generate code in languages like Rust, Go, or Kotlin.
Implication for Tool Development
The finding is relevant for development teams that rely on AI assistants outside the Python ecosystem. Multi-LCB provides a fairer, multilingual measure and is an open resource for future evaluations, giving model makers a clearer signal on where training on underrepresented languages needs improvement.
Frequently Asked Questions
- What is Multi-LCB?
- Multi-LCB is an extension of the LiveCodeBench benchmark from Python to 12 programming languages, designed to measure multilingual code generation in large language models.
- How many models were tested?
- The authors tested 24 large language models and discovered significant Python overfitting and language-specific data contamination.
- Where was the paper published?
- The paper arXiv:2606.20517 was accepted at ICLR 2026.
Related news
UK AISI: Engineering Playbook Opens Frontier Model Evaluation Infrastructure in Five Layers
Black Forest Labs: Robin Rombach Calls on G7 Leaders to Support Open AI Development
Allen Institute: Open-Source MolmoMotion Predicts 3D Motion from Video and Sets SOTA in Robotics