AWS G7e Blackwell Instances: Qwen3-32B on SageMaker for $0.41 per Million Tokens — 4× Cheaper Inference
Why it matters
AWS G7e instances are new SageMaker GPU instances with the NVIDIA RTX PRO 6000 Blackwell chip and 96 GB GDDR7 memory, delivering up to 2.3× better inference than G6e. The cost for Qwen3-32B drops from $2.06 to $0.79 per million output tokens, and with EAGLE speculative decoding down to $0.41.
What does AWS bring with G7e instances?
AWS announced G7e on April 20, 2026 — a new generation of GPU instances for Amazon SageMaker AI. The instances use the NVIDIA RTX PRO 6000 Blackwell Server Edition with 96 GB GDDR7 memory — twice the memory of the previous G6e generation, with a faster and newer memory standard.
The goal of G7e is clear: enable cheaper and faster inference of large language models on SageMaker, the platform that is the standard for enterprise ML deployment on AWS.
How much faster are they?
AWS benchmarks show up to 2.3× better inference performance compared to G6e for generative models. The concrete example AWS provides is the Qwen3-32B model:
- G6e — $2.06 per million output tokens
- G7e — $0.79 per million output tokens
- G7e + EAGLE speculative decoding — $0.41 per million output tokens
With EAGLE (a technique where a draft model predicts tokens in advance and a target model verifies them) the price drops 4× compared to G6e with the same technique. For production systems generating billions of tokens per month, this is the difference between thousands and tens of thousands of dollars.
Which model sizes are supported?
G7e comes in several configurations:
- g7e.2xlarge — 1 GPU, $4.20/h, supports models up to 35B parameters (e.g., Qwen3-32B, Llama-3.1-8B)
- 2 GPU variant — for models up to ~70B parameters
- 4 GPU variant — for even larger models
- 8 GPU variant — up to 300B parameters, for the largest open-source models
The lineup covers the full scale — from small production models to the boundary of what can be self-served today.
What does 1,600 Gbps EFA networking mean?
For multi-GPU and multi-node deployments, networking between instances is critical. G7e supports EFA (Elastic Fabric Adapter) networking up to 1,600 Gbps. EFA is AWS technology that allows GPUs to communicate directly, bypassing the classic TCP/IP stack — critical for distributed inference where a model is split across multiple devices.
In practice, this means a 300B model can be served across 8 GPUs without network becoming a bottleneck that dominates latency — which was previously a problem on weaker instance types.
Implications for the inference market
G7e changes the economics of self-hosted LLM inference. Until now it was cheaper to use dedicated inference providers like Groq, Together, or Fireworks than to self-host a model on AWS. With a price of $0.41 per million tokens, AWS is approaching those prices while offering the advantage of full model control, fine-tuning, and data privacy.
For enterprise customers who already have AWS contracts and compliance requirements, G7e becomes a serious alternative for production inference. It also puts pressure on competing inference providers — if AWS can offer a similar price with simple SageMaker integration, differentiation must come from another dimension (latency, SLA, additional features).
This article was generated using artificial intelligence from primary sources.
Related news
Gemma 4 runs as a Vision Language Agent locally on Jetson Orin Nano Super
NVIDIA and Google Cloud announce collaboration for agentic AI and physical AI on shared infrastructure
Google unveils 8th-generation TPU chips: two specialized variants for the agentic AI era