🤖 24 AI
🟡 🏥 In Practice Friday, April 17, 2026 · 3 min read

AWS Nova Micro for Text-to-SQL: fine-tuning + serverless Bedrock for $0.80 per month

Why it matters

AWS demonstrated how LoRA fine-tuning of the Amazon Nova Micro model combined with serverless Bedrock on-demand inference can handle 22,000 SQL queries per month for just $0.80. Training costs $8 through Bedrock Customization or $65 through SageMaker. The approach eliminates the cost of continuous model hosting and is calibrated for variable production workloads.

Amazon Web Services published a detailed case study on April 16, 2026, on building a text-to-SQL system using Nova Micro with LoRA fine-tuning and Bedrock on-demand inference. Authors Zeek Granston and Felipe Lopez present two parallel implementations — one through Amazon Bedrock Customization and one through SageMaker AI — and provide a clear cost breakdown for each approach.

Why LoRA + serverless?

The traditional self-hosted approach for custom SQL generation requires constant infrastructure — GPU instances running 24/7 regardless of usage. For internal BI tools where SQL is generated occasionally, this is a massive waste.

Low-Rank Adaptation (LoRA) enables fine-tuning of only a small additional parameter layer on top of the base model. When combined with serverless inference, you pay only per token — no fixed costs when the system is idle. AWS describes this approach as “custom text-to-SQL without the cost of continuous model hosting.”

Concrete costs

The post delivers a precise economic analysis that is rare in AWS technical materials:

One-time training cost:

  • Bedrock Customization: $8.00 (2,000 examples, 5 epochs)
  • SageMaker AI: $65.15 (4-hour job on ml.g5.48xlarge)

Monthly production workload of 22,000 queries:

  • Input cost: $0.616
  • Output cost: $0.184
  • Total monthly: $0.80

The difference is dramatic compared to any self-hosting scenario, where a GPU instance alone would cost several hundred dollars per month regardless of query volume.

Technical hyperparameters

The authors share the concrete configuration that worked through Bedrock:

  • Number of epochs: 5
  • Learning rate: 0.00001
  • Warmup steps: 10
  • Training duration: 2–3 hours

Training data came from the public sql-create-context dataset with more than 78,000 natural language and SQL query pairs. Training and validation loss curves consistently decrease and converge — an indicator of stable fine-tuning without overfitting.

The latency cost

There is no free lunch. The LoRA adapter adds overhead during inference:

  • Cold-start TTFT (time-to-first-token): 639 ms (34% above base model)
  • Warm-start TTFT: 380 ms (7% above)
  • Token generation rate: ~183 tokens/second (27% below base model)
  • End-to-end response: ~477 ms

AWS describes this latency as “still very suitable for interactive applications” — a description that warrants careful interpretation. For a user interface where SQL is generated as the user types, an extra ~30 percent latency is acceptable. For a batch process generating hundreds of queries at once, the cumulative overhead can be significant.

When to use this approach

AWS explicitly targets variable workloads where cost is a priority over absolute speed. Typical scenarios include internal BI tools in enterprises, chat assistants for legacy databases, and analytics tools used occasionally rather than continuously. For systems with high and predictable volume, dedicated hosting remains more economical.

🤖

This article was generated using artificial intelligence from primary sources.