NVIDIA Nemotron OCR v2: 34.7 pages per second, five languages in one model, 28x faster than PaddleOCR
Why it matters
NVIDIA published Nemotron OCR v2 on HuggingFace — a multilingual OCR model processing 34.7 pages per second on a single A100 GPU. That is 28 times faster than PaddleOCR v5. The model supports English, Chinese, Japanese, Korean and Russian in a single architecture, with no language detection required. Trained on 12.2 million synthetic images, the model and dataset are available under the NVIDIA Open Model License and CC-BY-4.0.
NVIDIA published Nemotron OCR v2 on HuggingFace on April 17, 2026 — the second generation of its optical character recognition system. Authors Bo Liu, Ryan Chesler, Yuri Babakhin and pCriisS have achieved performance that redefines the industry standard — 34.7 pages per second on a single A100 GPU for the multilingual model.
Speed and benchmarks
On the OmniDocBench benchmark, Nemotron OCR v2 (multilingual) vs. competition:
| Model | Pages/s |
|---|---|
| PaddleOCR v5 | 1.2 |
| OpenOCR | 1.5 |
| Nemotron OCR v2 (multi) | 34.7 |
| Nemotron OCR v2 (EN) | 40.7 |
| EasyOCR | 0.4 |
That is 28 times faster than PaddleOCR v5 and 87 times faster than EasyOCR. For an enterprise processing millions of documents per day, the difference between 1 and 35 pages/s translates into dramatic savings in GPU hours.
Multilingual in one model
The key innovation of v2 is its language-agnostic architecture. A single model covers:
- English
- Chinese (Simplified and Traditional)
- Japanese
- Korean
- Russian
No language detection required. Classic OCR stacks use separate models for each language and must first detect which language appears in the image — adding latency and potentially failing on mixed-language documents. Nemotron OCR v2 elegantly avoids this through 14,244 characters in a single character set (v1 had only 855).
Synthetic training — 12.2 million images
The biggest technical innovation is not the architecture itself but the approach to data. NVIDIA built a synthetic pipeline generating:
A total of 12.2 million images across six languages, with a typical distribution of 1.5 to 2.3 million per language (train/test/val split).
Synthetic pipeline
Text source: mOSCAR (multilingual web corpus, 163 language subsets)
Rendering engine: Modified SynthDoG with extensions:
- Multi-level bounding boxes (word, line, paragraph with 4-point quads)
- Hierarchical reading order graphs (inspired by the HierText project)
- Varied layout modes: multi-column text, scattered text, vertical columns, tables, slides, documents
- 165 to 1,258 open-source fonts per language (Google Fonts, Noto family)
- Line-level recognition for CJK languages (without word segmentation)
Augmentations:
- Text-level: borders, shadows, extrusion, edge noise, stroke opacity
- Image-level: morphological operators, median blur, elastic distortion
- Page-level: contrast/brightness jitter, Gaussian/motion blur, shadows
FOTS architecture
Three components, one backbone:
- Text Detector (RegNetX-8GF)
- Text Recognizer (6-layer pre-norm Transformer for multilingual)
- Relational Model (compact Transformer encoder)
The key to efficiency is the shared convolutional backbone — input is processed once, and feature reuse across all three components eliminates redundant computation. That is where the 28x speedup over cascade pipelines comes from — pipelines where each stage re-processes the input.
Quality is as impressive as speed
Normalized Edit Distance (NED) on the SynthDoG multilingual benchmark — lower is better:
| Language | PaddleOCR | OpenOCR | Nemotron v1 | Nemotron v2 |
|---|---|---|---|---|
| English | 0.117 | 0.105 | 0.078 | 0.069 |
| Japanese | 0.201 | 0.586 | 0.723 | 0.046 |
| Korean | 0.133 | 0.837 | 0.923 | 0.047 |
| Russian | 0.163 | 0.950 | 0.564 | 0.043 |
| Chinese S. | 0.054 | 0.061 | 0.784 | 0.035 |
| Chinese T. | 0.094 | 0.127 | 0.700 | 0.065 |
The v1 → v2 leap is dramatic. Japanese from 0.723 to 0.046. Korean from 0.923 to 0.047. Chinese Traditional from 0.700 to 0.065. That is orders-of-magnitude improvement.
Licensing and availability
- Model:
nvidia/nemotron-ocr-v2on HuggingFace - Dataset:
nvidia/OCR-Synthetic-Multilingual-v1(12.2M images) - Demo: Space on HuggingFace for live testing
- Model license: NVIDIA Open Model License (commercial use permitted)
- Dataset license: CC-BY-4.0
The open dataset is particularly valuable — research groups now have access to the pipeline for calibrating their own OCR models using the same methodology.
Why this matters
Nemotron OCR v2 represents a moment where synthetic data is demonstrated as fully adequate for tasks that traditionally required expensive manual labelling. The synthetic pipeline is cheaper, more scalable and — most importantly — covers languages for which there is insufficient real training data.
For enterprises wanting OCR as a component of their AI stack, especially for multilingual document workflows, Nemotron OCR v2 sets a new baseline — not just for quality, but for economics.
This article was generated using artificial intelligence from primary sources.
Related news
Thinking with Reasoning Skills (ACL 2026 Industry Track): fewer tokens, higher accuracy through retrieval of reasoning skills
DeepSeek releases V4-Pro and V4-Flash: two open-source models with one million token context and 80.6 on SWE Verified
OpenAI introduces GPT-5.5: the smartest model for coding, research, and complex data analysis through tools