How does ToolSimulator differ from classic mocks?

Classic mocks return statically defined responses — the developer writes in advance what the function will return. ToolSimulator uses an LLM to generate contextually appropriate responses on the fly and, crucially, maintains shared state throughout the entire conversation — if the agent created a user in step 1, that user will still exist in step 5.

What is PII and why does it matter in agent testing?

PII stands for Personally Identifiable Information — data like names, addresses, tax IDs, and emails. Testing agents on real APIs risks exposing PII in logs, databases, or analytics. ToolSimulator completely avoids the real API, so there is no PII exposure.

Can I use ToolSimulator in my own project?

Yes, the framework is part of AWS's Strands Evals platform. Integration uses the Python @simulator.tool() decorator to register tools and Pydantic models for schema enforcement. The developer describes what the tool does, and the simulator generates realistic responses during testing without any contact with production systems.

AWS ToolSimulator: Testing AI Agents Without API Calls

What did AWS announce?

On April 20, 2026, AWS introduced ToolSimulator — a new framework within the Strands Evals platform designed for safe and scalable testing of AI agents without executing live API calls. The goal is to solve one of the most painful problems in building production agents: how to test an agent that sends emails, modifies databases, or books flights without causing real consequences.

Why are classic mocks insufficient?

Developers have used mocks for decades — fake versions of external systems that return pre-defined responses. The problem with agents is that they conduct dynamic, multi-turn conversations in which system state evolves. A classic mock is static: it returns the same response every time. It cannot say “you created this user in step 1, now you can update them in step 5.”

The result: mocks are either too thin (missing realism) or too expensive to maintain (every test scenario requires hand-coding a state machine).

How does ToolSimulator solve the problem?

ToolSimulator uses an LLM under the hood to generate tool responses on the fly. The key innovation is shared state — the simulator remembers everything that happened in the conversation and ensures future responses are consistent with history.

Example: the agent calls create_user(name="Ana") in step 2. In step 7 it calls list_users() — ToolSimulator knows Ana must be in the list because she was previously created. Without an LLM, the developer would have to manually code that state; with one, the simulator does it automatically.

What does the integration look like?

The technical integration is declarative and straightforward:

@simulator.tool() decorator — the developer marks a Python function as a tool available to the agent. The simulator automatically captures the signature and docstring.
Pydantic models — used for schema enforcement. What does this mean? Pydantic checks that arguments and return values have the correct types — if the agent sends a string instead of a number, the test fails at that point, before the LLM generates a response.

The developer writes a description of the tool, not the implementation — ToolSimulator covers the rest.

Why is PII protection important?

PII (Personally Identifiable Information) means personally identifiable data — names, tax IDs, addresses, phone numbers, email addresses.

Testing agents on real APIs means PII leaks into logs, staging databases, and analytics. This is a regulatory problem (GDPR in the EU, HIPAA in the US) and a practical problem (leakage from staging to the public).

ToolSimulator never calls the real API, so there is no PII source — the simulation generates synthetic data that looks realistic but is not tied to real individuals.

Who benefits from this?

Any team building agents with tool use. From startups testing MVP agents to large organizations validating production deployments. Especially useful for:

Unit tests — isolating one agent interaction with one tool
End-to-end tests — entire workflows with multiple tools and steps
Regression tests — verifying that a new model behaves the same as the old one

Conclusion

ToolSimulator is a concrete response to a real problem: production agents need to be tested, and testing on live systems is expensive, slow, and risky. With this move, AWS signals that agent observability and testability are becoming first-class citizens in cloud infrastructure, not just an optional add-on. Integration with Strands Evals gives the platform a complete stack — from development through simulation to evaluation.

AWS ToolSimulator: LLM-Powered AI Agent Testing Without Live API Calls — Shared State Across Multi-Turn Conversations