AWS Nova Micro for Text-to-SQL: fine-tuning + serverless Bedrock for $0.80 per month
Why it matters
AWS demonstrated how LoRA fine-tuning of the Amazon Nova Micro model combined with serverless Bedrock on-demand inference can handle 22,000 SQL queries per month for just $0.80. Training costs $8 through Bedrock Customization or $65 through SageMaker. The approach eliminates the cost of continuous model hosting and is calibrated for variable production workloads.
Amazon Web Services published a detailed case study on April 16, 2026, on building a text-to-SQL system using Nova Micro with LoRA fine-tuning and Bedrock on-demand inference. Authors Zeek Granston and Felipe Lopez present two parallel implementations — one through Amazon Bedrock Customization and one through SageMaker AI — and provide a clear cost breakdown for each approach.
Why LoRA + serverless?
The traditional self-hosted approach for custom SQL generation requires constant infrastructure — GPU instances running 24/7 regardless of usage. For internal BI tools where SQL is generated occasionally, this is a massive waste.
Low-Rank Adaptation (LoRA) enables fine-tuning of only a small additional parameter layer on top of the base model. When combined with serverless inference, you pay only per token — no fixed costs when the system is idle. AWS describes this approach as “custom text-to-SQL without the cost of continuous model hosting.”
Concrete costs
The post delivers a precise economic analysis that is rare in AWS technical materials:
One-time training cost:
- Bedrock Customization: $8.00 (2,000 examples, 5 epochs)
- SageMaker AI: $65.15 (4-hour job on ml.g5.48xlarge)
Monthly production workload of 22,000 queries:
- Input cost: $0.616
- Output cost: $0.184
- Total monthly: $0.80
The difference is dramatic compared to any self-hosting scenario, where a GPU instance alone would cost several hundred dollars per month regardless of query volume.
Technical hyperparameters
The authors share the concrete configuration that worked through Bedrock:
- Number of epochs: 5
- Learning rate: 0.00001
- Warmup steps: 10
- Training duration: 2–3 hours
Training data came from the public sql-create-context dataset with more than 78,000 natural language and SQL query pairs. Training and validation loss curves consistently decrease and converge — an indicator of stable fine-tuning without overfitting.
The latency cost
There is no free lunch. The LoRA adapter adds overhead during inference:
- Cold-start TTFT (time-to-first-token): 639 ms (34% above base model)
- Warm-start TTFT: 380 ms (7% above)
- Token generation rate: ~183 tokens/second (27% below base model)
- End-to-end response: ~477 ms
AWS describes this latency as “still very suitable for interactive applications” — a description that warrants careful interpretation. For a user interface where SQL is generated as the user types, an extra ~30 percent latency is acceptable. For a batch process generating hundreds of queries at once, the cumulative overhead can be significant.
When to use this approach
AWS explicitly targets variable workloads where cost is a priority over absolute speed. Typical scenarios include internal BI tools in enterprises, chat assistants for legacy databases, and analytics tools used occasionally rather than continuously. For systems with high and predictable volume, dedicated hosting remains more economical.
This article was generated using artificial intelligence from primary sources.
Related news
arXiv:2604.21361: Open Compute Project maps time/causality failures in distributed AI inference systems — 5 ms clock skew breaks observability
GitHub changes App installation token format: from 40 to ~520 characters, breakage risk for CI/CD pipelines
GitHub Copilot receives GPT-5.5 GA: available on all major IDEs with 7.5× premium multiplier