How does SWE-chat differ from existing benchmarks?

Most existing coding-agent benchmarks, such as SWE-bench, use synthetic or archived GitHub issues. SWE-chat is a dataset of real conversations that developers held with an agent in production, reflecting genuine queries, corrections, and feedback.

What is the dataset used for?

For researchers and teams developing coding agents, the dataset enables understanding of user expectations, identification of typical conversation breakpoints, and targeted evaluation of improvements through realistic scenarios rather than synthetic tests.

What does this mean for the development of AI coding tools?

Realistic data on user and agent behavior in production enables targeted improvement of weak points — for example in error correction, requesting additional context, or deciding when to give up — which is harder to achieve when working only with synthetic benchmarks.

ArXiv SWE-chat: dataset of real interactions with coding agents

The problem with synthetic benchmarks

The last two years of AI coding agent development have largely relied on synthetic benchmarks like SWE-bench, HumanEval, and their variants. These benchmarks typically take historical GitHub issues or carefully crafted programming tasks and measure whether the agent is able to produce a solution that passes tests. The problem is that such tests don’t reflect how developers actually work with an agent — they don’t capture ambiguous instructions, partial contexts, intermediate conversation steps, or situations where the user changes their mind mid-task.

SWE-chat, a recently published dataset on ArXiv, attempts to fill exactly that gap. The authors describe the dataset as a collection of real in-the-wild interactions between users and AI coding agents in a production environment. Rather than carefully selected examples, the dataset contains natural conversations from developers using an autonomous system to solve their everyday tasks — fixing bugs, refactoring modules, writing tests, or asking for help with configuration.

What the dataset captures

According to the ArXiv publication, SWE-chat provides insight into how developers actually use autonomous systems in practice. This includes typical query formulations, how users respond to the agent’s suggestions, reactions to incorrect or partially correct answers, and moments where the conversation escalates into multi-step iteration. Such data is hard to reconstruct from lab conditions because it requires real production usage and cooperative users who consent to having their conversations recorded for research purposes.

The dataset thus opens the door to analyses that were previously beyond the reach of the academic community. Researchers can observe how conversation quality changes over time, what strategies users develop with experience, when they give up on the agent and switch to manual work, and which types of tasks the agent reliably solves versus where it consistently fails. For teams developing their own agents, SWE-chat becomes a realistic test substrate for regression evaluations of new versions.

Implications for agent development and evaluation

The most important implication of the SWE-chat dataset is a shift from synthetic toward ecological validity in evaluation. While synthetic benchmarks measure whether the agent is technically capable of solving a problem, SWE-chat measures whether it can solve it under the conditions in which the system is actually used — with incomplete information, changing instructions, and human feedback. This is closer to a real measure of usefulness than any previous benchmark.

For the AI coding tools developer community, the dataset is invaluable because it enables targeted improvement of weak points. If SWE-chat analysis shows that agents consistently fail at requesting additional context from users, that becomes a clear development priority. If it turns out users most often give up when the agent misunderstands the intent of a task, teams can invest in better instruction understanding. Instead of guiding development by numbers on synthetic tests that don’t reflect reality, it becomes possible to guide it by real data on user and agent behavior in production.

ArXiv SWE-chat — a dataset of real developer interactions with AI coding agents in production

The problem with synthetic benchmarks

What the dataset captures

Implications for agent development and evaluation

Sources

Related news