AWS ToolSimulator: LLM-Powered AI Agent Testing Without Live API Calls — Shared State Across Multi-Turn Conversations
Why it matters
AWS introduced ToolSimulator, an LLM-powered framework within the Strands Evals platform for safely testing AI agents without executing live API calls. The simulator maintains consistent shared state across multi-turn conversations and generates contextually appropriate responses, enabling testing of agents that send emails or modify databases without real consequences.
What did AWS announce?
On April 20, 2026, AWS introduced ToolSimulator — a new framework within the Strands Evals platform designed for safe and scalable testing of AI agents without executing live API calls. The goal is to solve one of the most painful problems in building production agents: how to test an agent that sends emails, modifies databases, or books flights without causing real consequences.
Why are classic mocks insufficient?
Developers have used mocks for decades — fake versions of external systems that return pre-defined responses. The problem with agents is that they conduct dynamic, multi-turn conversations in which system state evolves. A classic mock is static: it returns the same response every time. It cannot say “you created this user in step 1, now you can update them in step 5.”
The result: mocks are either too thin (missing realism) or too expensive to maintain (every test scenario requires hand-coding a state machine).
How does ToolSimulator solve the problem?
ToolSimulator uses an LLM under the hood to generate tool responses on the fly. The key innovation is shared state — the simulator remembers everything that happened in the conversation and ensures future responses are consistent with history.
Example: the agent calls create_user(name="Ana") in step 2. In step 7 it calls list_users() — ToolSimulator knows Ana must be in the list because she was previously created. Without an LLM, the developer would have to manually code that state; with one, the simulator does it automatically.
What does the integration look like?
The technical integration is declarative and straightforward:
@simulator.tool()decorator — the developer marks a Python function as a tool available to the agent. The simulator automatically captures the signature and docstring.- Pydantic models — used for schema enforcement. What does this mean? Pydantic checks that arguments and return values have the correct types — if the agent sends a string instead of a number, the test fails at that point, before the LLM generates a response.
The developer writes a description of the tool, not the implementation — ToolSimulator covers the rest.
Why is PII protection important?
PII (Personally Identifiable Information) means personally identifiable data — names, tax IDs, addresses, phone numbers, email addresses.
Testing agents on real APIs means PII leaks into logs, staging databases, and analytics. This is a regulatory problem (GDPR in the EU, HIPAA in the US) and a practical problem (leakage from staging to the public).
ToolSimulator never calls the real API, so there is no PII source — the simulation generates synthetic data that looks realistic but is not tied to real individuals.
Who benefits from this?
Any team building agents with tool use. From startups testing MVP agents to large organizations validating production deployments. Especially useful for:
- Unit tests — isolating one agent interaction with one tool
- End-to-end tests — entire workflows with multiple tools and steps
- Regression tests — verifying that a new model behaves the same as the old one
Conclusion
ToolSimulator is a concrete response to a real problem: production agents need to be tested, and testing on live systems is expensive, slow, and risky. With this move, AWS signals that agent observability and testability are becoming first-class citizens in cloud infrastructure, not just an optional add-on. Integration with Strands Evals gives the platform a complete stack — from development through simulation to evaluation.
This article was generated using artificial intelligence from primary sources.
Related news
Anthropic: Memory for Managed Agents in public beta — AI agents that remember context between sessions
GitHub: Cloud agent sessions now available directly from issues and project views
ArXiv SWE-chat — a dataset of real developer interactions with AI coding agents in production