AMD: Instinct MI355X outperforms NVIDIA B200 on ComfyUI workflows with PyTorch optimizations in ROCm 7.2.0
AMD Instinct MI355X is a data center GPU that outperforms NVIDIA B200 in published benchmarks across three ComfyUI generative workflows — text-to-video Wan2.2 (1.44×), text-to-image FLUX.1-dev (1.42×), and 3D Hunyuan3D v2.1 (1.20×) — thanks to AOTriton gfx950 kernels, hipBLASLt GEMM tuning, and other ROCm 7.2.0 optimizations.
This article was generated using artificial intelligence from primary sources.
AMD has released benchmarks showing that the Instinct MI355X outperforms NVIDIA’s B200 across three ComfyUI generative workflows through PyTorch attention optimizations for the CDNA4 architecture in ROCm 7.2.0. Results are distributed via a Docker image that any user can run.
Results by workflow
The numbers are clear:
- Text-to-Video (Wan2.2): MI355X achieves a 1.439× speedup with a runtime of 116.91 seconds versus 168.28 s on B200.
- Text-to-Image (FLUX.1-dev): 1.416× faster, 24.77 s vs. 35.09 s.
- 3D Generation (Hunyuan3D v2.1): 1.201× speedup, 21.51 s vs. 25.84 s.
CDNA4 hardware
MI355X brings 256 compute units, 288 GB of HBM3e memory, and memory bandwidth of 8 TB/s. These specifications favor the attention-heavy operations typical of diffusion models — large memory holds intermediate representations of high-resolution images and video frames without tiling, and bandwidth reduces memory wait time.
Optimizations in ROCm 7.2.0
Performance does not come from hardware alone. The key software additions are:
- AOTriton gfx950 kernel support — native attention acceleration via the ahead-of-time Triton compiler for CDNA4.
- Occupancy tuning — reduced warp count for better efficiency.
- hipBLASLt GEMM optimization — tuned kernels for FP8, BF16, and FP16, the dominant types in diffusion and transformer workloads.
- Pipelining and ThinLTO compiler optimizations.
What this means for the AI hardware market
Three production scenarios — video, image, 3D — cover the majority of generative ComfyUI use cases. That AMD shows consistent numbers across all three signals that the CUDA/ROCm gap, long a structural advantage for NVIDIA, is narrowing depending on the software maturity of individual frameworks. For ComfyUI users, AMD is now a legitimate option — at least at the benchmark level.
Frequently Asked Questions
- What is the CDNA4 architecture?
- CDNA4 is AMD's latest data center GPU architecture, used in the MI355X. It delivers 256 compute units, 288 GB of HBM3e memory, and 8 TB/s memory bandwidth — particularly favorable for attention-heavy operations in transformer and diffusion models.
- What are AOTriton and hipBLASLt?
- AOTriton is AMD's ahead-of-time Triton compiler with native kernels for gfx950 (CDNA4) that accelerates attention operations. hipBLASLt is AMD's GEMM library with tuned kernels for FP8/BF16/FP16 — the types predominantly used in modern diffusion and transformer models.
- Are the benchmarks reproducible?
- AMD published a Docker image with pre-configured optimizations. Anyone can run the same configuration to replicate the results. Detailed numbers (Wan2.2: 116.91s vs 168.28s, FLUX.1-dev: 24.77s vs 35.09s, Hunyuan3D: 21.51s vs 25.84s) are transparent and shown in the published table.
Related news
AMD: ROCm 7.13 brings MI350P GPU, multi-VF virtualisation and TheRock packaging
AMD ROCm: BubbleFence partitions video streams using Vision Foundation model embeddings instead of metadata heuristics
AMD ROCm: Kimi-K2.5 W4A8 and W8A8 quantization on MI325X via Quark + FlyDSL + AITER inference stack