What is the difference between Bedrock Customization and SageMaker for fine-tuning?

Bedrock Customization is a managed service that minimizes infrastructure overhead — training costs $8 for 2,000 examples. SageMaker gives granular control over hyperparameters and instance types — the same job costs $65 due to the ml.g5.48xlarge instance.

What is the trade-off with the LoRA + serverless approach?

Latency increases — cold-start TTFT is 639 ms (34% above the base model), warm start is 380 ms (7% above). Token generation drops to 183 tokens per second (27% drop). In return, there is no fixed infrastructure cost.

Who typically uses this kind of setup?

Companies with variable workloads where a self-hosted model would sit idle regardless of usage. LoRA + serverless is ideal for internal BI tools, internal chat assistants, and occasionally used SQL interfaces.

AWS Nova Micro for Text-to-SQL: fine-tuning + serverless Bedrock for $0.80 per month

Amazon Web Services published a detailed case study on April 16, 2026, on building a text-to-SQL system using Nova Micro with LoRA fine-tuning and Bedrock on-demand inference. Authors Zeek Granston and Felipe Lopez present two parallel implementations — one through Amazon Bedrock Customization and one through SageMaker AI — and provide a clear cost breakdown for each approach.

Why LoRA + serverless?

The traditional self-hosted approach for custom SQL generation requires constant infrastructure — GPU instances running 24/7 regardless of usage. For internal BI tools where SQL is generated occasionally, this is a massive waste.

Low-Rank Adaptation (LoRA) enables fine-tuning of only a small additional parameter layer on top of the base model. When combined with serverless inference, you pay only per token — no fixed costs when the system is idle. AWS describes this approach as “custom text-to-SQL without the cost of continuous model hosting.”

Concrete costs

The post delivers a precise economic analysis that is rare in AWS technical materials:

One-time training cost:

Bedrock Customization: $8.00 (2,000 examples, 5 epochs)
SageMaker AI: $65.15 (4-hour job on ml.g5.48xlarge)

Monthly production workload of 22,000 queries:

Input cost: $0.616
Output cost: $0.184
Total monthly: $0.80

The difference is dramatic compared to any self-hosting scenario, where a GPU instance alone would cost several hundred dollars per month regardless of query volume.

Technical hyperparameters

The authors share the concrete configuration that worked through Bedrock:

Number of epochs: 5
Learning rate: 0.00001
Warmup steps: 10
Training duration: 2–3 hours

Training data came from the public sql-create-context dataset with more than 78,000 natural language and SQL query pairs. Training and validation loss curves consistently decrease and converge — an indicator of stable fine-tuning without overfitting.

The latency cost

There is no free lunch. The LoRA adapter adds overhead during inference:

Cold-start TTFT (time-to-first-token): 639 ms (34% above base model)
Warm-start TTFT: 380 ms (7% above)
Token generation rate: ~183 tokens/second (27% below base model)
End-to-end response: ~477 ms

AWS describes this latency as “still very suitable for interactive applications” — a description that warrants careful interpretation. For a user interface where SQL is generated as the user types, an extra ~30 percent latency is acceptable. For a batch process generating hundreds of queries at once, the cumulative overhead can be significant.

When to use this approach

AWS explicitly targets variable workloads where cost is a priority over absolute speed. Typical scenarios include internal BI tools in enterprises, chat assistants for legacy databases, and analytics tools used occasionally rather than continuously. For systems with high and predictable volume, dedicated hosting remains more economical.