Semantic layer +17-23pp accuracy for text-to-SQL benchmark

An ArXiv preprint by Rumiantsau and Fokeev (April 28, 2026) tests three frontier LLMs (Claude Opus 4.7, Sonnet 4.6, GPT-5.4) on 100 text-to-SQL questions over the Cleaned Contoso retail dataset in ClickHouse. Without a semantic layer, models achieve 45.5-50.5% accuracy; with a 4KB markdown semantic document, 67.7-68.7% — models are statistically indistinguishable within tier.

Michael Rumiantsau and Ivan Fokeev published the ArXiv preprint Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models on April 28, 2026. The study poses a simple but powerful question: how much does the model “matter” versus how much does context (semantic layer) matter for text-to-SQL accuracy?

Experimental Setup

The authors test three frontier models: Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.4. The benchmark consists of 100 natural language questions translated into SQL queries over the Cleaned Contoso Retail Dataset hosted in ClickHouse. Each model runs two rounds: once without a semantic layer, and once with a 4 KB markdown document that describes “measures, conventions, and disambiguation rules” for the dataset.

Results

The results are surprisingly clean:

Without semantic layer: 45.5%–50.5% accuracy across all three models
With semantic layer: 67.7%–68.7% accuracy across all three models
Improvement: +17 to +23 percentage points

Within each of the two setups, models are statistically indistinguishable. In other words, Opus 4.7 is not significantly better than Sonnet 4.6 or GPT-5.4 if they use the same context.

The Main Takeaway

The authors’ quote: “The presence of the semantic-layer document accounts for essentially all of the significant variance; model choice within tier does not.”

For enterprise practice, the message is clear: a better frontier model within the same tier will not replace better dataset documentation. A 4 KB markdown file with metric definitions, naming conventions, and homonym disambiguation rules delivers 17-23 percentage points — which is more significant than any model upgrade within tier.

Frequently Asked Questions

What is a semantic layer in the context of text-to-SQL?

A hand-authored markdown document (4 KB in this study) that describes 'measures, conventions, and disambiguation rules' for the dataset. It defines what individual columns mean, how metrics are calculated, and how to resolve homonyms.

What is the main takeaway of the study?

The presence of the semantic layer document accounts for 'practically all of the significant variance' in results — model choice within the same tier (Opus 4.7 vs Sonnet 4.6 vs GPT-5.4) does not yield a statistically significant difference.

What were the models tested on?

100 questions over the Cleaned Contoso Retail Dataset in ClickHouse. Each model was tested with the same 100 questions in two variants: without the semantic layer document and with a 4 KB markdown semantic layer.

Text-to-SQL Benchmark Study: 4KB Semantic Layer Adds 17-23 Percentage Points of Accuracy, Model Choice Doesn't Decide

Experimental Setup

Results

The Main Takeaway

Frequently Asked Questions

Sources

Related news