Google Research: ConvApparel dataset measures the 'realism gap' between AI user simulators and real people
Why it matters
Google Research has released ConvApparel — a new dataset of over 4,000 multi-turn conversations in an apparel shopping context, designed to measure how realistic LLM-based user simulators really are. The study shows that SFT and ICL approaches significantly outperform simple prompting and demonstrate 'remarkable out-of-distribution generalization'.
Google Research has released ConvApparel, a new dataset and evaluation framework that addresses an often-ignored problem in AI development: LLM user simulators behave unconvincingly. When AI agents are trained exclusively on conversations with these artificial “users”, they fail once they finally have to face real people.
What the problem is
When training a chatbot or AI agent, it needs someone to talk to. Training on real users is expensive and slow, so the standard practice is to use an LLM as a simulated user — another LLM plays the role of the end user and talks to the agent in the training loop. But these simulated “users” exhibit characteristics that real humans rarely have: excessive detail in responses, perfect personality consistency, unlimited patience, encyclopedic knowledge. The result: an agent that performs excellently in testing can collapse as soon as someone from the real internet shows up.
How they measured it
The dataset contains over 4,000 multi-turn conversations in an apparel shopping scenario. A dual-agent protocol was used in which participants unknowingly chat with either a “Good” (helpful) or “Bad” (unhelpful) agent — which produced natural variations from satisfaction to frustration. The framework uses three checks: population-level statistical alignment, human-likeness scoring (a trained discriminator that attempts to recognize synthetic conversations), and counterfactual validation — are simulators trained only on “good” agent data able to react realistically to frustrating “bad” agents?
Results and what remains
Discriminators reliably detected simulated conversations as synthetic — confirming that the problem truly exists. Data-driven simulators (ICL in-context learning and SFT supervised fine-tuning) significantly outperformed plain prompting on statistical alignment. Most interestingly: SFT and ICL simulators demonstrated “remarkable out-of-distribution generalization” — they successfully adapted to frustrating agents they had never seen during training.
Open question: what is the minimum level of realism needed for an agent trained on a simulator to work in production? Google is calling for future real-world validation studies.