🟢 🏥 In Practice Published: · 4 min read ·

arXiv:2605.22664: WorkstreamBench tests LLM agents on end-to-end spreadsheet tasks in finance — and frontier models fail

arXiv:2605.22664 ↗

Editorial illustration: Excel spreadsheet with formulas and an AI agent analyzing them

WorkstreamBench is a new benchmark from a 10-author team led by Thomson Yen that tests LLM agents on real Excel and spreadsheet tasks in the financial domain — invoices, reports, cost analysis. GPT-4o, Claude, and Gemini are compared and none passes reliably through the full task set, pointing to structural shortcomings in current agentic infrastructure for enterprise finance.

🤖

This article was generated using artificial intelligence from primary sources.

The arXiv preprint WorkstreamBench, published on 22 May 2026, introduces the first benchmark that tests LLM agents on real end-to-end spreadsheet tasks in the financial domain. The paper led by Thomson Yen — ten authors in total — designed tasks that match the actual daily practice of accountants and financial analysts: invoice processing, generating monthly reports, cost analysis across multiple worksheets. The main finding: no frontier model passes reliably through the full task set, even with access to an Excel API tool.

Why is a financial spreadsheet workflow hard for AI?

A surface-level look at Excel tasks might suggest that an LLM with tool access should trivially handle work tasks — GPT and Claude already demonstrate high scores on MMLU math and HumanEval programming. But a real spreadsheet workflow involves layers that MMLU-style benchmarks do not touch:

Structural complexity: a workflow often spans 10–50 cells with interdependent formulas. Changing one entry triggers a cascade of downstream results. The agent must understand the dependency graph, not just individual formulas.

Mixed formula styles: a real spreadsheet combines VLOOKUP, INDEX-MATCH, SUMPRODUCT, dynamic array formulas (FILTER, SORT, UNIQUE in modern Excel versions), pivot table references, and custom Named Ranges. The agent must understand the semantic role of any of these in the workflow.

External validation: specific figures (tax rates, exchange rate tables, account codes) must match external references. An agent that generates a syntactically correct workflow but uses the wrong 2026 tax rates produces a result that looks reasonable but is business-incorrect.

Conditional formatting as business logic: in real practice, conditional formatting expresses business rules (overdue invoices in red, approved transactions in green). The agent must understand that formatting is not decoration but a semantic layer.

Which models were tested and what were the results?

The paper tests four frontier models in two environments: isolated (the model receives a CSV representation of the spreadsheet and writes a textual response) and agentic (the model has access to an Excel COM API or openpyxl tool and can execute operations).

Results in the agentic environment:

ModelInvoiceReportAnalysisTotal
GPT-4o58%47%41%49%
Claude Sonnet 4.654%51%43%49%
Claude Opus 4.763%56%52%57%
Gemini 3 Pro51%44%38%44%

Claude Opus 4.7 leads with a 57 percent aggregate score — but that means 43 percent of tasks produce an incorrect result. In finance, an incorrect result is not “close to correct” — it is an account that does not reconcile, an incorrectly billed amount, or a wrong report for a regulator.

What are the concrete failure modes?

The authors document four most common failure modes:

  1. Reference drift: the agent updates one cell but does not update all formulas that reference it. Result: summary amounts do not match detail figures.
  2. Format ignored: the agent generates the correct numeric value but does not apply the currency format or decimal precision that the workflow requires — producing a report that a business analyst rejects.
  3. Validation skip: the agent does not verify that the generated amounts match external source documents (e.g., a PDF invoice). Result: the spreadsheet state does not reflect reality.
  4. Schema break: the agent adds new columns but does not update the pivot table or dashboard that consumes the data — breaking downstream reports.

What does this mean for SaaS products marketed as “AI for accountants”?

The implications for enterprise AI products are concrete. Products marketed as “automated invoice processing” or “AI bookkeeper” — including some top-tier SaaS products on the Croatian and European markets — most likely cannot reliably process an entire workflow without human review at every step. Marketing materials often suggest autonomous processing; the benchmark shows that the reality is still “AI suggests, human approves.”

The authors suggest two directions for improvement. First, fine-tuning models on curated spreadsheet workflow datasets (labeled datasets of ~10,000 tasks, which the benchmark uses, exist). Second, integrating a formal validation layer that verifies semantic equivalence of old and new state before applying changes — preventing reference drift and schema break failure modes.

WorkstreamBench is public and available to researchers for reproduction and extension.

Frequently Asked Questions

Why is a financial spreadsheet workflow hard for AI?
Spreadsheet tasks in finance are not isolated Excel tricks — they involve end-to-end logic connecting 10–50 cells, formulas with VLOOKUP and INDEX-MATCH structures, validation against external sources, and conditional formatting that reflects business rules. An agent must understand both structure and semantics.
Which models are tested?
The paper tests GPT-4o, Claude Sonnet 4.6, Claude Opus 4.7, and Gemini 3 Pro in an isolated environment (no external tool) and in an agentic environment (with an Excel API tool). Performance is measured through formula correctness, end-state validity, and workflow completion rate.
What are the practical implications of the results?
For SaaS products marketed as 'AI for accountants' (automated invoice processors, AI bookkeepers) — the results show that reliable automation of real financial spreadsheet workflows is still out of reach without human review at every step.