🤖 24 AI
🟡 🤖 Models Saturday, April 18, 2026 · 4 min read

NVIDIA Nemotron OCR v2: 34.7 pages per second, five languages in one model, 28x faster than PaddleOCR

Why it matters

NVIDIA published Nemotron OCR v2 on HuggingFace — a multilingual OCR model processing 34.7 pages per second on a single A100 GPU. That is 28 times faster than PaddleOCR v5. The model supports English, Chinese, Japanese, Korean and Russian in a single architecture, with no language detection required. Trained on 12.2 million synthetic images, the model and dataset are available under the NVIDIA Open Model License and CC-BY-4.0.

NVIDIA published Nemotron OCR v2 on HuggingFace on April 17, 2026 — the second generation of its optical character recognition system. Authors Bo Liu, Ryan Chesler, Yuri Babakhin and pCriisS have achieved performance that redefines the industry standard — 34.7 pages per second on a single A100 GPU for the multilingual model.

Speed and benchmarks

On the OmniDocBench benchmark, Nemotron OCR v2 (multilingual) vs. competition:

ModelPages/s
PaddleOCR v51.2
OpenOCR1.5
Nemotron OCR v2 (multi)34.7
Nemotron OCR v2 (EN)40.7
EasyOCR0.4

That is 28 times faster than PaddleOCR v5 and 87 times faster than EasyOCR. For an enterprise processing millions of documents per day, the difference between 1 and 35 pages/s translates into dramatic savings in GPU hours.

Multilingual in one model

The key innovation of v2 is its language-agnostic architecture. A single model covers:

  • English
  • Chinese (Simplified and Traditional)
  • Japanese
  • Korean
  • Russian

No language detection required. Classic OCR stacks use separate models for each language and must first detect which language appears in the image — adding latency and potentially failing on mixed-language documents. Nemotron OCR v2 elegantly avoids this through 14,244 characters in a single character set (v1 had only 855).

Synthetic training — 12.2 million images

The biggest technical innovation is not the architecture itself but the approach to data. NVIDIA built a synthetic pipeline generating:

A total of 12.2 million images across six languages, with a typical distribution of 1.5 to 2.3 million per language (train/test/val split).

Synthetic pipeline

Text source: mOSCAR (multilingual web corpus, 163 language subsets)

Rendering engine: Modified SynthDoG with extensions:

  • Multi-level bounding boxes (word, line, paragraph with 4-point quads)
  • Hierarchical reading order graphs (inspired by the HierText project)
  • Varied layout modes: multi-column text, scattered text, vertical columns, tables, slides, documents
  • 165 to 1,258 open-source fonts per language (Google Fonts, Noto family)
  • Line-level recognition for CJK languages (without word segmentation)

Augmentations:

  • Text-level: borders, shadows, extrusion, edge noise, stroke opacity
  • Image-level: morphological operators, median blur, elastic distortion
  • Page-level: contrast/brightness jitter, Gaussian/motion blur, shadows

FOTS architecture

Three components, one backbone:

  1. Text Detector (RegNetX-8GF)
  2. Text Recognizer (6-layer pre-norm Transformer for multilingual)
  3. Relational Model (compact Transformer encoder)

The key to efficiency is the shared convolutional backbone — input is processed once, and feature reuse across all three components eliminates redundant computation. That is where the 28x speedup over cascade pipelines comes from — pipelines where each stage re-processes the input.

Quality is as impressive as speed

Normalized Edit Distance (NED) on the SynthDoG multilingual benchmark — lower is better:

LanguagePaddleOCROpenOCRNemotron v1Nemotron v2
English0.1170.1050.0780.069
Japanese0.2010.5860.7230.046
Korean0.1330.8370.9230.047
Russian0.1630.9500.5640.043
Chinese S.0.0540.0610.7840.035
Chinese T.0.0940.1270.7000.065

The v1 → v2 leap is dramatic. Japanese from 0.723 to 0.046. Korean from 0.923 to 0.047. Chinese Traditional from 0.700 to 0.065. That is orders-of-magnitude improvement.

Licensing and availability

  • Model: nvidia/nemotron-ocr-v2 on HuggingFace
  • Dataset: nvidia/OCR-Synthetic-Multilingual-v1 (12.2M images)
  • Demo: Space on HuggingFace for live testing
  • Model license: NVIDIA Open Model License (commercial use permitted)
  • Dataset license: CC-BY-4.0

The open dataset is particularly valuable — research groups now have access to the pipeline for calibrating their own OCR models using the same methodology.

Why this matters

Nemotron OCR v2 represents a moment where synthetic data is demonstrated as fully adequate for tasks that traditionally required expensive manual labelling. The synthetic pipeline is cheaper, more scalable and — most importantly — covers languages for which there is insufficient real training data.

For enterprises wanting OCR as a component of their AI stack, especially for multilingual document workflows, Nemotron OCR v2 sets a new baseline — not just for quality, but for economics.

🤖

This article was generated using artificial intelligence from primary sources.