🟢 🤖 Models Published: · 2 min read ·

arXiv:2606.20543: Spatially Speculative Decoding accelerates image generation 13.3×

arXiv:2606.20543 ↗

Editorial illustration: Spatially Speculative Decoding accelerates image generation 13.3×

SSD (Spatially Speculative Decoding) is a new method that simultaneously predicts the horizontal and vertical neighbor of a pixel in autoregressive image generation, achieving up to 13.3× speedup without any loss of visual quality on the DPG-Bench and GenEval benchmarks.

🤖

This article was generated using artificial intelligence from primary sources.

Autoregressive image generation gets a 2D superpower

On June 19, 2026, authors Shilong Xiang, Zirui Zhang, Lijun Yu, and Chengzhi Mao published arXiv:2606.20543, introducing Spatially Speculative Decoding (SSD) — a method that challenges a core assumption of autoregressive visual models.

Classical autoregressive models generate images token by token, in a flat 1D sequence. SSD breaks with this approach: instead of one token at a time, it simultaneously predicts two spatially adjacent tokens — the horizontal neighbor and the token directly below. This exploits the two-dimensional structure of images and reduces the total number of decoding steps.

13.3× speedup with unchanged visual quality

Evaluation on DPG-Bench and GenEval benchmarks confirms up to 13.3× speedup in autoregressive image generation. Crucially, visual quality remains high — SSD does not sacrifice image fidelity for speed, which has traditionally been the typical trade-off in aggressive decoding optimizations.

Addressing the memory wall — the high-resolution bottleneck

SSD directly targets the memory wall problem: when generating high-resolution images, classical serial decoding becomes a critical bottleneck due to the exponential growth in the number of tokens. The spatially speculative approach mitigates this problem structurally, not merely through engineering workarounds.

Comparison with existing approaches

While standard speculative decoders in NLP accelerate sequential language models by proposing multiple tokens in a single step, SSD specifically exploits the 2D topology of images — a dimension that linguistic speculative decoders do not have. The work differs from compression methods (quantization, pruning) in that it does not alter model parameters, only the inference strategy.

The paper was submitted on June 18 and published on June 19, 2026.

Frequently Asked Questions

What is Spatially Speculative Decoding and how does it differ from classical autoregressive decoding?
SSD simultaneously predicts two tokens — the horizontal neighbor and the vertical neighbor — by exploiting the 2D spatial structure of the image, rather than treating the image as a flat 1D sequence of tokens. This eliminates the serial bottleneck and drastically reduces the number of decoding steps.
On which benchmarks was SSD evaluated and what are the results?
The method was tested on DPG-Bench and GenEval benchmarks, achieving up to 13.3× speedup in autoregressive image generation while maintaining high visual quality.
What problem does SSD solve in high-resolution image generation?
SSD directly addresses the memory wall bottleneck that arises during high-resolution autoregressive image generation, where the serial nature of classical methods becomes a critical constraint due to the enormous number of tokens.