IBM Granite 4.1: open Apache 2.0, 3B/8B/30B, 15T tokens

On April 29, 2026, IBM published on the HuggingFace blog the details of building the Granite 4.1 model family — 3B, 8B, and 30B dense variants under the Apache 2.0 license. Trained on ~15T tokens through a 5-phase pipeline strategy, with a 4-phase RL using GRPO+DAPO loss. Granite 4.1-8B Instruct matches or surpasses the previous Granite 4.0-H-Small (32B-A9B MoE) on most benchmarks — showing that dense models reach MoE quality at the same activation budget.

On April 29, 2026, IBM published a technical deep-dive on the HuggingFace blog about building Granite 4.1 — an open-source family of LLMs under the Apache 2.0 license. The post is significantly more detailed than standard marketing launches and includes concrete figures on the pre-training pipeline, RL phases, and benchmark results.

Sizes and architecture

Three dense variants (NOT MoE):

Size	Layers	Embed dim	KV heads
3B	40	2,560	8 GQA
8B	40	4,096	8 GQA
30B	64	4,096	8 GQA

All variants use GQA (Grouped Query Attention), RoPE, SwiGLU activations, and RMSNorm. Context scales to 512K tokens through staged Long-Context Extension (LCE) with a training mix of 80% books + 20% code in the final phase.

5-phase pre-training (~15T tokens)

A sophisticated strategy:

Phase	Tokens	Focus
1	10T	General (59% CommonCrawl, 20% Code, 7% Math)
2	2T	Math/Code emphasis (35% Math, 30% Code)
3	2T	High-quality annealing + 12.5% CoT
4	0.5T	Refinement (40% CommonCrawl-HQ, 9% language instructions)
5	variable	Long-context extension 32K→128K→512K

SFT + 4-phase RL pipeline

After pre-training:

SFT on ~4.1M curated samples, 3 epochs, 5e-6 learning rate, 16K sequence length
RL pipeline uses on-policy GRPO with DAPO loss (Yu et al., 2025):
1. Multi-domain RL (45,504 prompts)
2. RLHF (17,920 prompts) → ~18.9-point gain on AlpacaEval
3. Identity & Knowledge-Calibration RL (1,728 prompts)
4. Math RL (13,504 prompts) → +3.8 GSM8K, +23.48 DeepMind-Math

Key result: 8B dense ≈ 32B MoE

The most interesting finding: Granite 4.1-8B Instruct matches or surpasses the previous Granite 4.0-H-Small (32B-A9B MoE) on:

IFEval, AlpacaEval, MMLU-Pro, BBH (general)
GSM8K, DeepMind-Math (math)
HumanEval+, ArenaHard, BFCL V3, MBPP+ (code)

Concrete 8B Instruct numbers: MMLU 73.84, GSM8K 92.49, HumanEval 87.20, AlpacaEval 2.0 50.08, IFEval Avg 87.06, BFCL v3 68.27.

This suggests the MoE advantage has narrowed at the same activation budget — 8B dense (8B active) can compete with 32B-A9B MoE (9B active). This goes against the trends set by Mixtral and DeepSeek-V3.

Long-context performance

On the RULER benchmark:

8B-base: 83.6 (32K) → 79.1 (64K) → 73.0 (128K)
30B-base: 85.2 (32K) → 84.6 (64K) → 76.7 (128K)

512K is available but RULER was not evaluated at that length.

Infrastructure and deployment

Training on NVIDIA GB200 NVL72 clusters (72-GPU NVLink domains, NDR 400 Gb/s InfiniBand). FP8 quantization available for inference (~50% reduction in disk/GPU memory). 12 languages supported: English, German, Spanish, French, Japanese, Portuguese, Arabic, Czech, Italian, Korean, Dutch, Chinese.

Resources:

HuggingFace: ibm-granite/granite-41-language-models
GitHub: ibm-granite/granite-4.1-language-models

Frequently Asked Questions

What sizes and architecture does Granite 4.1 offer?

Three dense variants: 3B (40 layers, 2,560 embed), 8B (40 layers, 4,096 embed), 30B (64 layers, 4,096 embed). All use Grouped Query Attention (8 KV heads), RoPE, SwiGLU, RMSNorm. Context scales to 512K tokens through staged Long-Context Extension (LCE).

What is the 5-phase pre-training strategy?

Phase 1 (10T tokens): general (59% CommonCrawl, 20% Code, 7% Math). Phase 2 (2T): math/code emphasis. Phase 3 (2T): high-quality annealing with 12.5% CoT data. Phase 4 (0.5T): refinement. Phase 5 (variable): long-context extension (32K→128K→512K) on 80% Books + 20% Code.

What does it mean that the 8B matches the 32B MoE?

Granite 4.1-8B Instruct matches or surpasses the previous Granite 4.0-H-Small (32B-A9B MoE) on IFEval, AlpacaEval, MMLU-Pro, BBH, GSM8K, DeepMind-Math, HumanEval+, ArenaHard, BFCL V3, and MBPP+. This suggests that the MoE advantage has narrowed at comparable active parameters.

IBM Granite 4.1: open-source family of 3B/8B/30B Apache 2.0 models trained on 15T tokens shows that a dense 8B model matches 32B MoE