AWS Bedrock AgentCore Optimization in preview: automated loop from production traces to A/B tests via OpenTelemetry
AWS presented AgentCore Optimization in preview on May 4, 2026 — an automated loop that derives concrete recommendations for system prompts and tool descriptions from production traces, runs batch evaluation against a test set, and performs A/B tests with statistical significance. The system collects OpenTelemetry-compatible traces of every model invocation, tool call, and reasoning step, replacing manual prompt guessing with a structured cycle grounded in production data.
This article was generated using artificial intelligence from primary sources.
AWS presented on May 4, 2026 on its Machine Learning blog AgentCore Optimization as a new preview feature within Amazon Bedrock. The goal of the system is to improve production agents without manual prompt editing and guessing, which has until now been the standard optimization approach.
What does the system concretely automate?
AgentCore Optimization delivers three key components:
- Recommendations — analyze production traces and evaluation results to suggest specific changes to the system prompt or tool descriptions. The engineer does not write the suggestion; the system generates it from real data.
- Batch evaluation — tests recommendations against a predefined test set to verify that changes are improvements in the broader case, not just for the example that triggered the recommendation.
- A/B testing — a controlled experiment between the old and new agent versions with statistical significance data, avoiding hasty “subjectively better” deployments.
How does the system collect production data?
End-to-end traceability in AgentCore records every model call, tool call, and reasoning step as an OpenTelemetry-compatible trace (OpenTelemetry is an open standard for distributed system observability). Companies already using OTEL in their stack can import existing infrastructure without additional instrumentation.
The result is that AgentCore Optimization works on real production examples, not on synthetic tests manually assembled by engineers. The system sees which prompts the agent receives in real-world conditions, where it fails, and how errors propagate through tool calls.
What does this change in operational agent management?
Most enterprise agents in 2026 stall in the phase between POC and full production. The reason is that teams have no systematic way to measure how prompt changes affect behavior. AgentCore Optimization addresses exactly that gap: the system becomes a device that learns from production data and suggests changes with measurable impact.
AWS uses the example of a Market Trends Agent for investment brokers in the blog post, but does not provide concrete benchmark numbers. This means the preview phase is focused on demonstrating the architecture, not on selling quantifiable results.
Pricing has not been publicly announced. The preview is available to Amazon Bedrock users in regions where AgentCore is already available.
Frequently Asked Questions
- What does AgentCore Optimization automate?
- Three things: (1) Recommendations that analyze production traces and evaluation results to propose changes to the system prompt or tool descriptions, (2) Batch evaluation against a predefined test set, (3) A/B testing between agent versions with statistical significance data.
- How are production traces collected?
- Through AgentCore's end-to-end traceability, which records every model call, tool call, and reasoning step as an OpenTelemetry-compatible trace. Development teams can import existing OTEL infrastructure without additional instrumentation.
- What is the key contribution compared to manual agent optimization?
- It replaces guesswork with structure: production data → recommendation → validation before deployment. The previous workflow required an engineer to read traces, manually change prompts, and hope the change worked — now the cycle is measurable.
Sources
Related news
ArXiv GUI-SD: first on-policy self-distillation framework for GUI grounding outperforms GRPO across six benchmarks in accuracy and training efficiency
ArXiv AEM: Adaptive Entropy Modulation for multi-turn RL agents achieves +1.4% on SWE-bench Verified
Position paper by 30 authors at ICML 2026: agentic AI orchestration must be Bayes-consistent