🟡 📦 Open Source Published: · 2 min read ·

GitHub: Open Multilingual Repositories Dataset with 80 Million Rows and 40 Million Repositories

Editorial illustration: multilingual dataset from open code repositories

GitHub has published the Multilingual Repositories Dataset with more than 80 million classification rows across 40+ million repositories, under the fully open CC0-1.0 license. For each repository the dataset records three text sources — the README, the most-commented issue, and the most-commented pull request — alongside language detection through three tools: fastText, gcld3, and lingua-py. Portuguese leads among non-English README files, while Korean is the most represented in issue discussions.

🤖

This article was generated using artificial intelligence from primary sources.

GitHub has released the Multilingual Repositories Dataset, an open dataset aimed at researchers and development teams building multilingual AI systems.

What does the dataset contain?

The dataset spans more than 80 million classification rows across 40+ million repositories and is published under the CC0-1.0 license, which places the content in the public domain with no usage restrictions. For each repository the dataset records three text sources: the README file, the most-commented issue, and the most-commented pull request. It also includes metadata such as creation date, star count, fork count, primary programming language, and SPDX license tag.

How does GitHub detect the language of a repository?

Language detection is performed through three independent tools — fastText, gcld3, and lingua-py — each providing its own confidence score above the threshold of 0.5. Using three tools instead of one reduces classification errors and allows researchers to filter examples by the level of agreement among detectors.

Which languages stand out in the data?

According to GitHub, Portuguese leads among non-English README files with more than 3 million repositories, while Korean is the most represented non-English language in issue discussions. That gap shows that linguistic diversity varies depending on whether you are looking at documentation or community conversation.

What is the dataset used for?

GitHub lists several use cases: building multilingual evaluation sets for AI tools, better representation of European languages in the open-source ecosystem, and natural language processing research. The open CC0 license removes legal barriers to using the data for model training and evaluation.

Frequently Asked Questions

What does the GitHub Multilingual Repositories Dataset contain?
80+ million classification rows across 40+ million repositories, with the README, most-commented issue, and PR per repository.
Under which license is the dataset?
Under the CC0-1.0 license, placing it in the public domain and free for any use.
How is language detected?
Through three independent tools — fastText, gcld3, and lingua-py — each with a confidence score above the threshold of 0.5.