Text-to-SQL Benchmark Study: 4KB Semantic Layer Adds 17-23 Percentage Points of Accuracy, Model Choice Doesn't Decide
An ArXiv preprint by Rumiantsau and Fokeev (April 28, 2026) tests three frontier LLMs (Claude Opus 4.7, Sonnet 4.6, GPT-5.4) on 100 text-to-SQL questions over the Cleaned Contoso retail dataset in ClickHouse. Without a semantic layer, models achieve 45.5-50.5% accuracy; with a 4KB markdown semantic document, 67.7-68.7% — models are statistically indistinguishable within tier.
Michael Rumiantsau and Ivan Fokeev published the ArXiv preprint Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models on April 28, 2026. The study poses a simple but powerful question: how much does the model “matter” versus how much does context (semantic layer) matter for text-to-SQL accuracy?
Experimental Setup
The authors test three frontier models: Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.4. The benchmark consists of 100 natural language questions translated into SQL queries over the Cleaned Contoso Retail Dataset hosted in ClickHouse. Each model runs two rounds: once without a semantic layer, and once with a 4 KB markdown document that describes “measures, conventions, and disambiguation rules” for the dataset.
Results
The results are surprisingly clean:
- Without semantic layer: 45.5%–50.5% accuracy across all three models
- With semantic layer: 67.7%–68.7% accuracy across all three models
- Improvement: +17 to +23 percentage points
Within each of the two setups, models are statistically indistinguishable. In other words, Opus 4.7 is not significantly better than Sonnet 4.6 or GPT-5.4 if they use the same context.
The Main Takeaway
The authors’ quote: “The presence of the semantic-layer document accounts for essentially all of the significant variance; model choice within tier does not.”
For enterprise practice, the message is clear: a better frontier model within the same tier will not replace better dataset documentation. A 4 KB markdown file with metric definitions, naming conventions, and homonym disambiguation rules delivers 17-23 percentage points — which is more significant than any model upgrade within tier.
Frequently Asked Questions
- What is a semantic layer in the context of text-to-SQL?
- A hand-authored markdown document (4 KB in this study) that describes 'measures, conventions, and disambiguation rules' for the dataset. It defines what individual columns mean, how metrics are calculated, and how to resolve homonyms.
- What is the main takeaway of the study?
- The presence of the semantic layer document accounts for 'practically all of the significant variance' in results — model choice within the same tier (Opus 4.7 vs Sonnet 4.6 vs GPT-5.4) does not yield a statistically significant difference.
- What were the models tested on?
- 100 questions over the Cleaned Contoso Retail Dataset in ClickHouse. Each model was tested with the same 100 questions in two variants: without the semantic layer document and with a 4 KB markdown semantic layer.
This article was generated using artificial intelligence from primary sources.
Related news
DeepMind AI co-clinician: in blind evaluation of 98 primary care queries doctors preferred it over leading tools, zero critical errors in 97/98 cases
Anthropic Claude for Creative Work: Connectors for Blender, 50+ Adobe Creative Cloud Tools, Autodesk Fusion, Ableton, SketchUp, and Splice
Google ERA: AI system for scientific research reaches CDC top for hospitalization forecasting, solves an open cosmological problem, and tracks CO2 every 10 minutes