DeepSeek releases V4-Pro and V4-Flash: two open-source models with one million token context and 80.6 on SWE Verified
Why it matters
DeepSeek on April 24, 2026 released V4-Pro (1.6T / 49B active) and V4-Flash (284B / 13B active), two open-source models with one million token context. V4-Pro scored 80.6 on SWE Verified, near Opus 4.6, with drastically reduced memory consumption.
DeepSeek on Thursday, April 24, 2026, published a preview release of the V4 series: two open-weight open-source models — V4-Pro with 1.6 trillion total and 49 billion active parameters, and V4-Flash with 284 billion total and 13 billion active parameters. Both models support one million token context by default across all official services.
The release comes at a moment when competition among frontier models is shifting from pure benchmark numbers to efficiency in long context and agentic workflows. DeepSeek published the weights on Hugging Face Hub along with an accompanying technical report.
What does the new V4 architecture bring?
The key innovation is a hybrid attention mechanism combining two complementary techniques. The first is CSA (Compressed Sparse Attention), which compresses every four tokens into a single KV record using a learned positional approach, with an FP4 “Lightning Indexer” then selecting the top-k most relevant compressed blocks per query.
The second is HCA (Heavily Compressed Attention) with a 128× compression ratio, using dense MQA (Multi-Query Attention) over heavily compressed blocks without the need for sparse selection. Both techniques retain a sliding full-attention window over the most recent tokens.
The result is a KV cache of just 2% compared to the standard GQA-8 baseline. At 1M token context, V4-Pro uses 27% of the FLOPs required by V3.2, and V4-Flash only 10%.
How capable are they on real tasks?
On the SWE Verified benchmark, which measures a model’s ability to autonomously resolve real GitHub bugs, V4-Pro-Max achieves 80.6%, practically matching Anthropic’s Opus 4.6-Max (80.8%). On Toolathlon, measuring tool orchestration, V4-Pro is first at 51.8 versus 50.0 for Kimi K2.6.
On an internal benchmark of 30 tasks from PyTorch, CUDA, Rust, and C++, V4-Pro-Max hits 67% of tasks, slightly behind Opus 4.5 (70%) and significantly ahead of Sonnet 4.5 (47%). In an internal developer survey among 91 DeepSeek engineers, 52% are ready to switch their primary coding model, with an additional 39% leaning toward “yes”.
How does agentic post-training work?
Alongside architectural changes, DeepSeek introduced interleaved thinking — a chain of reasoning that persists across user message boundaries in multi-step tool call flows. Without tools, the model behaves classically and clears its reasoning at each new message.
For tool calls, an XML tool-call format was introduced with a dedicated |DSML| token. Example:
|DSML|
<tool_call>
<function_name>search</function_name>
<parameters>
<param name="query" string="true">weather in Zagreb</param>
</parameters>
</tool_call>
The advantage is reduced errors when escaping nested quotes and the separation of string and structured parameters — a typical pain point with JSON schemas.
What is the DSec sandbox?
For agentic reinforcement learning, DeepSeek built DSec (DeepSeek Elastic Compute), a Rust-based infrastructure supporting four execution layers: function calls, containers, microVMs (Firecracker), and full VMs (QEMU). The system scales to hundreds of thousands of parallel sandboxes and enables “preemption-safe replay” — resuming training without re-executing tool calls.
This infrastructure is why V4 can be trained on real tool environments rather than synthetic traces, which is noticeable in its strength on the Toolathlon and MCPAtlas benchmarks.
When is the migration deadline?
DeepSeek simultaneously announced that the old deepseek-chat and deepseek-reasoner endpoints will be fully shut down on July 24, 2026 at 15:59 UTC. Development teams using the DeepSeek API have three months to migrate.
The new versions are available in three reasoning modes (non-think, think-high, think-max), and models are released in FP4 quantization for MoE experts and FP8 for the rest, further reducing memory requirements.
For development teams looking to self-host the models, V4-Flash is the more practical option — 13 billion active parameters enables inference on more standard GPU hardware than V3.2 required.
This article was generated using artificial intelligence from primary sources.
Related news
Thinking with Reasoning Skills (ACL 2026 Industry Track): fewer tokens, higher accuracy through retrieval of reasoning skills
OpenAI introduces GPT-5.5: the smartest model for coding, research, and complex data analysis through tools
Apple introduces MANZANO — a unified multimodal model that balances image understanding and generation