What problem does the new SageMaker feature solve?

Deploying a generative AI model to production requires choosing a GPU type, number of instances, batch size, and optimization settings. Testing all combinations manually takes weeks. SageMaker AI now automates the process and provides recommendations in hours rather than weeks.

What is NVIDIA AIPerf and how is it used?

NVIDIA AIPerf is a tool for measuring AI model performance that generates synthetic queries and measures latency, throughput, and cost. SageMaker AI calls it internally across different GPU configurations — H100, A100, L4 — and automatically saves the results in a comparison table.

By what criteria are recommendations ranked?

Recommendations can be ranked by three criteria: total cost per inference call, latency from first to last token, and maximum request throughput per second. Teams can choose the configuration that matches their priority — cheaper for batch processing, faster for interactive applications.

SageMaker: automated GenAI inference recommendations

The end of weeks of manual testing

Amazon SageMaker AI has received a new feature that automatically benchmarks generative AI models across different GPU configurations. Instead of teams manually testing combinations of H100, A100, L4, and other GPU types with different batch sizes and optimizations, SageMaker now generates a valid list of deployment configurations in hours.

The feature uses the NVIDIA AIPerf tool in the background. AIPerf generates synthetic queries that simulate real workloads, measures latency from the first to the last token, and calculates request throughput per second. SageMaker runs tests in parallel across multiple configurations and collects results in a single comparison table.

Three ranking criteria

Results can be ranked by three different criteria, depending on business priority. The first criterion is total cost per inference call — calculated by combining instance price and average response generation time. This is important for applications with high query volume.

The second criterion is latency. Interactive applications such as chatbots require fast time-to-first-token and consistent generation speed. The third criterion is maximum throughput — how many parallel requests a configuration can serve before degrading. Useful for applications that process batch queries or data.

Practical benefit for MLOps teams

The typical production deployment process looks like this: the team selects a model, makes an initial hardware estimate, runs a load test, discovers performance issues, changes the configuration, repeats the test. This loop repeats for weeks. The new SageMaker feature eliminates these iterations because it covers most relevant configurations in a single pass.

An important detail is that SageMaker does not return just one “best” configuration, but an entire ranked list. Teams can review the cost-versus-latency trade-off and make informed decisions. For example, a configuration that is 20 percent cheaper but 30 percent slower may be acceptable for certain applications.

Integration with the existing workflow

The feature is integrated into the existing SageMaker AI workflow. The user passes the model and constraints — for example “maximum cost $0.01 per call” or “latency below 500ms” — and SageMaker returns configurations that meet the criteria. Results include endpoint configurations ready for direct deployment.

This is a concrete automation of MLOps decisions that previously required an experienced engineer with deep knowledge of GPU architectures. For companies without such specialists, the feature democratizes access to optimal deployment configurations.

AWS SageMaker automatically benchmarks generative AI models and provides optimal inference configurations

The end of weeks of manual testing

Three ranking criteria

Practical benefit for MLOps teams

Integration with the existing workflow

Sources

Related news