How does Nemotron OCR v2 achieve such speed?

The architecture is FOTS-based (Fast Oriented Text Spotting) with three components sharing a single convolutional backbone — a text detector (RegNetX-8GF), recognizer (6-layer pre-norm Transformer) and relational model. Input is processed once, and feature reuse across all three components eliminates redundant computations.

What is the difference between v2_english and v2_multilingual?

The English variant has 54M parameters, a 3-layer recognizer, 855 characters and 40.7 pages/s. Multilingual has 84M parameters, a 6-layer recognizer, 14,244 characters and 34.7 pages/s — covering 5 languages in the same model weights.

How large is the quality improvement compared to v1?

Dramatic. v1 had NED scores from 0.56 to 0.92 on non-English languages (unusable). v2 achieves NED 0.035 to 0.069 across all languages — an order-of-magnitude reduction in error rate.

NVIDIA Nemotron OCR v2: 34.7 pages per second, five languages in one model, 28x faster than PaddleOCR

NVIDIA published Nemotron OCR v2 on HuggingFace on April 17, 2026 — the second generation of its optical character recognition system. Authors Bo Liu, Ryan Chesler, Yuri Babakhin and pCriisS have achieved performance that redefines the industry standard — 34.7 pages per second on a single A100 GPU for the multilingual model.

Speed and benchmarks

On the OmniDocBench benchmark, Nemotron OCR v2 (multilingual) vs. competition:

Model	Pages/s
PaddleOCR v5	1.2
OpenOCR	1.5
Nemotron OCR v2 (multi)	34.7
Nemotron OCR v2 (EN)	40.7
EasyOCR	0.4

That is 28 times faster than PaddleOCR v5 and 87 times faster than EasyOCR. For an enterprise processing millions of documents per day, the difference between 1 and 35 pages/s translates into dramatic savings in GPU hours.

Multilingual in one model

The key innovation of v2 is its language-agnostic architecture. A single model covers:

English
Chinese (Simplified and Traditional)
Japanese
Korean
Russian

No language detection required. Classic OCR stacks use separate models for each language and must first detect which language appears in the image — adding latency and potentially failing on mixed-language documents. Nemotron OCR v2 elegantly avoids this through 14,244 characters in a single character set (v1 had only 855).

Synthetic training — 12.2 million images

The biggest technical innovation is not the architecture itself but the approach to data. NVIDIA built a synthetic pipeline generating:

A total of 12.2 million images across six languages, with a typical distribution of 1.5 to 2.3 million per language (train/test/val split).

Synthetic pipeline

Text source: mOSCAR (multilingual web corpus, 163 language subsets)

Rendering engine: Modified SynthDoG with extensions:

Multi-level bounding boxes (word, line, paragraph with 4-point quads)
Hierarchical reading order graphs (inspired by the HierText project)
Varied layout modes: multi-column text, scattered text, vertical columns, tables, slides, documents
165 to 1,258 open-source fonts per language (Google Fonts, Noto family)
Line-level recognition for CJK languages (without word segmentation)

Augmentations:

Text-level: borders, shadows, extrusion, edge noise, stroke opacity
Image-level: morphological operators, median blur, elastic distortion
Page-level: contrast/brightness jitter, Gaussian/motion blur, shadows

FOTS architecture

Three components, one backbone:

Text Detector (RegNetX-8GF)
Text Recognizer (6-layer pre-norm Transformer for multilingual)
Relational Model (compact Transformer encoder)

The key to efficiency is the shared convolutional backbone — input is processed once, and feature reuse across all three components eliminates redundant computation. That is where the 28x speedup over cascade pipelines comes from — pipelines where each stage re-processes the input.

Quality is as impressive as speed

Normalized Edit Distance (NED) on the SynthDoG multilingual benchmark — lower is better:

Language	PaddleOCR	OpenOCR	Nemotron v1	Nemotron v2
English	0.117	0.105	0.078	0.069
Japanese	0.201	0.586	0.723	0.046
Korean	0.133	0.837	0.923	0.047
Russian	0.163	0.950	0.564	0.043
Chinese S.	0.054	0.061	0.784	0.035
Chinese T.	0.094	0.127	0.700	0.065

The v1 → v2 leap is dramatic. Japanese from 0.723 to 0.046. Korean from 0.923 to 0.047. Chinese Traditional from 0.700 to 0.065. That is orders-of-magnitude improvement.

Licensing and availability

Model: nvidia/nemotron-ocr-v2 on HuggingFace
Dataset: nvidia/OCR-Synthetic-Multilingual-v1 (12.2M images)
Demo: Space on HuggingFace for live testing
Model license: NVIDIA Open Model License (commercial use permitted)
Dataset license: CC-BY-4.0

The open dataset is particularly valuable — research groups now have access to the pipeline for calibrating their own OCR models using the same methodology.

Why this matters

Nemotron OCR v2 represents a moment where synthetic data is demonstrated as fully adequate for tasks that traditionally required expensive manual labelling. The synthetic pipeline is cheaper, more scalable and — most importantly — covers languages for which there is insufficient real training data.

For enterprises wanting OCR as a component of their AI stack, especially for multilingual document workflows, Nemotron OCR v2 sets a new baseline — not just for quality, but for economics.