What is the main difference compared to G6e?

G7e uses the NVIDIA RTX PRO 6000 Blackwell Server Edition with 96 GB GDDR7 memory — twice that of G6e. More memory means larger models fit on fewer GPUs, and faster memory means better throughput. The combination delivers up to 2.3× better inference performance.

What is EFA networking and why does it matter?

EFA (Elastic Fabric Adapter) is AWS's high-throughput networking technology that connects GPU instances directly, bypassing the standard TCP/IP stack. On G7e instances it achieves up to 1,600 Gbps bandwidth, which is critical for distributed inference of large models across multiple GPUs.

Does this mean self-hosted inference is now competitive with Groq?

Partially. $0.41/M tokens with EAGLE is close to the pricing of dedicated inference providers like Groq or Together, but those do not require managing infrastructure. G7e is interesting when you need model control, data privacy, or fine-tuning — things SaaS inference does not offer.

AWS G7e Blackwell SageMaker: 4× Cheaper AI Inference

What does AWS bring with G7e instances?

AWS announced G7e on April 20, 2026 — a new generation of GPU instances for Amazon SageMaker AI. The instances use the NVIDIA RTX PRO 6000 Blackwell Server Edition with 96 GB GDDR7 memory — twice the memory of the previous G6e generation, with a faster and newer memory standard.

The goal of G7e is clear: enable cheaper and faster inference of large language models on SageMaker, the platform that is the standard for enterprise ML deployment on AWS.

How much faster are they?

AWS benchmarks show up to 2.3× better inference performance compared to G6e for generative models. The concrete example AWS provides is the Qwen3-32B model:

G6e — $2.06 per million output tokens
G7e — $0.79 per million output tokens
G7e + EAGLE speculative decoding — $0.41 per million output tokens

With EAGLE (a technique where a draft model predicts tokens in advance and a target model verifies them) the price drops 4× compared to G6e with the same technique. For production systems generating billions of tokens per month, this is the difference between thousands and tens of thousands of dollars.

Which model sizes are supported?

G7e comes in several configurations:

g7e.2xlarge — 1 GPU, $4.20/h, supports models up to 35B parameters (e.g., Qwen3-32B, Llama-3.1-8B)
2 GPU variant — for models up to ~70B parameters
4 GPU variant — for even larger models
8 GPU variant — up to 300B parameters, for the largest open-source models

The lineup covers the full scale — from small production models to the boundary of what can be self-served today.

What does 1,600 Gbps EFA networking mean?

For multi-GPU and multi-node deployments, networking between instances is critical. G7e supports EFA (Elastic Fabric Adapter) networking up to 1,600 Gbps. EFA is AWS technology that allows GPUs to communicate directly, bypassing the classic TCP/IP stack — critical for distributed inference where a model is split across multiple devices.

In practice, this means a 300B model can be served across 8 GPUs without network becoming a bottleneck that dominates latency — which was previously a problem on weaker instance types.

Implications for the inference market

G7e changes the economics of self-hosted LLM inference. Until now it was cheaper to use dedicated inference providers like Groq, Together, or Fireworks than to self-host a model on AWS. With a price of $0.41 per million tokens, AWS is approaching those prices while offering the advantage of full model control, fine-tuning, and data privacy.

For enterprise customers who already have AWS contracts and compliance requirements, G7e becomes a serious alternative for production inference. It also puts pressure on competing inference providers — if AWS can offer a similar price with simple SageMaker integration, differentiation must come from another dimension (latency, SLA, additional features).

AWS G7e Blackwell Instances: Qwen3-32B on SageMaker for $0.41 per Million Tokens — 4× Cheaper Inference

What does AWS bring with G7e instances?

How much faster are they?

Which model sizes are supported?

What does 1,600 Gbps EFA networking mean?

Implications for the inference market

Sources

Related news