🤝 Agents

54 articles

🟡 🤝 Agents April 27, 2026 · 3 min read

arXiv:2604.22748: Survey by 42 authors introduces 'levels × laws' taxonomy for world models in AI agents — synthesis of 400+ papers

Abstract compass quill tracing layers of world models across physical, digital, social, and scientific domains of agentic systems.

A survey by 42 authors titled 'Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond' organizes the field through a two-dimensional taxonomy — three levels of model capability (Predictor, Simulator, Evolver) and four domains of laws (physical, digital, social, scientific). The synthesis covers over 400 references and more than 100 representative systems.

🟡 🤝 Agents April 27, 2026 · 3 min read

arXiv:2604.22452: Superminds Test shows collective intelligence does not emerge spontaneously in a society of 2 million AI agents

Abstract compass quill tracing sparse and shallow connections between a multitude of AI agents in a large digital community.

Researchers from the University of Melbourne and the University of Maryland introduced the Superminds Test, a hierarchical framework for probing the collective intelligence of agent societies. A study on the MoltBook platform with over 2 million agents showed that the society does not outperform individual frontier models and that interactions remain very sparse and shallow.

🟢 🤝 Agents April 27, 2026 · 3 min read

arXiv:2604.21910: Agentic AI automates scientific workflow with 83% accuracy, 92% less data transfer and $0.001 per query

Bartosz Balis and colleagues at AGH University in Kraków published on April 23, 2026 a paper that converts natural-language research queries into executable scientific workflows. The three-layer architecture (semantic LLM layer, deterministic generator, expert Skills) was tested on the 1000 Genomes workflow on Kubernetes — Skills raised intent accuracy from 44% to 83%, reduced data transfer by 92% at a cost below $0.001 per query.

🟡 🤝 Agents April 25, 2026 · 4 min read

arXiv:2604.21816: 'Tool Attention Is All You Need' Eliminates MCP Tax — 95% Token Reduction per Turn in Agentic Workflows

Editorial illustration: Tool Attention MCP Tax — agentic workflow optimization

Researchers Anuj Sadani and Deepak Kumar published a paper on ArXiv on April 23, 2026 addressing the so-called MCP Tax — eager schema injection that consumes 10 to 60 thousand tokens per turn. Their Tool Attention approach reduces consumption by 95% and raises context utilization from 24% to 91%.

🟢 🤝 Agents April 25, 2026 · 3 min read

AWS and Visier Demonstrate Enterprise Workforce AI Agents via Amazon Q and MCP Integration for HR Analytics

Editorial illustration: AWS Visier Amazon Q — workforce HR AI agents

AWS and Visier demonstrated workforce AI agent integration via Amazon Q and the Model Context Protocol. Visier exposes HR analytics as an MCP server, while Amazon Q agents use those tools for headcount budgeting, tenure tracking, and threshold alerts — all within a single conversational interface.

🟡 🤝 Agents April 24, 2026 · 3 min read

Anthropic: Memory for Managed Agents in public beta — AI agents that remember context between sessions

Editorial illustration: AI agent — agenti

Anthropic has released Memory for Claude Managed Agents into public beta. Agents can now retain user preferences, project conventions, and context between sessions. Beta limits include up to 1,000 stores per organization and 100 MB per store.

🟢 🤝 Agents April 24, 2026 · 2 min read

GitHub: Cloud agent sessions now available directly from issues and project views

GitHub has introduced the ability to track and manage cloud agent sessions directly from issues and project views. Session pills, side panels with progress logs, and automatically activated sessions in project views indicate deeper integration of autonomous AI agents into the development workflow.

🔴 🤝 Agents April 23, 2026 · 3 min read

Google DeepMind signs alliance with five leading consulting firms for enterprise AI

Google DeepMind has signed a partnership with five of the largest consulting firms — Accenture, Bain, BCG, Deloitte, and McKinsey — to accelerate enterprise AI transformation, given that currently only 25 percent of organizations manage to deploy AI to production.

🔴 🤝 Agents April 23, 2026 · 3 min read

OpenAI launches Workspace Agents in ChatGPT: Codex-powered agents for enterprise teams

OpenAI introduced Workspace Agents, Codex-powered AI agents integrated directly into the ChatGPT interface. The agents run in the cloud, automate complex workflows, and help enterprise teams scale work through connected tools with an emphasis on cross-application security.

🟡 🤝 Agents April 23, 2026 · 3 min read

AWS published architecture for company-wide AI agent memory using Bedrock, Neptune, and Mem0

AWS has published an architecture that combines Amazon Bedrock, the Neptune graph database, and the Mem0 framework for persistent AI agent memory at the company-wide level, solving the problem of context loss between sessions and users.

🟡 🤝 Agents April 23, 2026 · 2 min read

Amazon Bedrock AgentCore gets managed harness: a working agent in just three API calls

Amazon announced a managed agent harness for Bedrock AgentCore that enables deploying a fully working agent in just three API calls, without writing any orchestration infrastructure. The harness is accompanied by the AgentCore CLI covering the full development cycle and pre-built skills for coding assistants, available in preview across four AWS regions.

🟢 🤝 Agents April 23, 2026 · 3 min read

ArXiv SWE-chat — a dataset of real developer interactions with AI coding agents in production

SWE-chat has been published on ArXiv — a dataset of real so-called in-the-wild interactions between users and AI coding agents in production environments. Rather than another synthetic benchmark based on GitHub issues, this dataset captures how developers actually use autonomous systems during their everyday work — what they ask for, how they respond to the agent's suggestions, and where the agent fails — opening the door to more precise evaluation and targeted improvements in agent design.

🟢 🤝 Agents April 23, 2026 · 2 min read

OSWorld study: AI computer-use agents often fail when repeating the same task

New research shows that AI agents for computer control that successfully complete a task once may fail on an identical repeated attempt, with three key reasons being execution stochasticity, task specification ambiguity, and agent behavior variability.

🔴 🤝 Agents April 22, 2026 · 4 min read

Google ReasoningBank: agents learn from experience without retraining, +8.3% success on WebArena

Editorial illustration: Robot in a maze with illuminated nodes representing learned experience

Google introduced ReasoningBank, a memory framework that enables AI agents to learn from their own successes and failures without retraining the language model. On the WebArena benchmark it achieved 8.3% higher success rate, and on SWE-Bench-Verified 4.6% with approximately 3 fewer steps per task.

🔴 🤝 Agents April 22, 2026 · 4 min read

OpenAI scales Codex to enterprise: Codex Labs program and 4 million weekly active users

Editorial illustration: Futuristic cityscape with AI entity and corporate skyscrapers alongside code screens

OpenAI launched the Codex Labs program and strategic partnerships with Accenture, Deloitte, and KPMG to bring the Codex agent to large enterprises worldwide. The tool has reached 4 million weekly active users, offers certifications for consultants, and enterprise packages with a consumption-based billing model.

🟡 🤝 Agents April 22, 2026 · 2 min read

Agent-World: scalable environment synthesis for AI agent evolution from Renmin University

Editorial illustration: Dynamic environments with landscapes and cities automatically generated for AI agent training

Agent-World is a new research framework from China's Renmin University that automatically generates thousands of diverse environments for training AI agents. It replaces manually crafted benchmarks with dynamic scenarios and enables evolutionary learning through co-evolution of agent and environment.

🟡 🤝 Agents April 22, 2026 · 3 min read

Gemini Deep Research gets MCP integration, collaborative planning, and two new versions

Editorial illustration: Robot silhouette with modular servers and data flows for the Deep Research agent

Google launched two new Deep Research agent versions in the Gemini API — deep-research-preview-04-2026 and deep-research-max-preview-04-2026 — with MCP server integration, collaborative planning, visualizations, and streaming responses. The move positions Gemini as a serious competitor to ChatGPT Deep Research and Perplexity Deep Research.

🟡 🤝 Agents April 22, 2026 · 3 min read

Multi-Agent Systems survey: from classical paradigms to a large model-driven future

Editorial illustration: Connected AI agents in communication bridging classical paradigm with the modern LLM era

A new arXiv survey comprehensively bridges classical Multi-Agent Systems literature with the modern LLM-agent stack. The paper identifies a paradigm shift in coordination, communication protocols, and emergent behavior — from low-level state exchange to semantic reasoning.

🟡 🤝 Agents April 21, 2026 · 4 min read

AWS Combines Bedrock AgentCore, MCP and Nova 2 Sonic for Omnichannel Ordering — First Enterprise Agentic Showcase

AWS has published an architectural example combining Bedrock AgentCore Runtime, the MCP protocol and the Nova 2 Sonic voice model in an omnichannel ordering system. This is the first public integration of the new AWS agentic services and a demonstration of microVM isolation for production agents.

🟡 🤝 Agents April 21, 2026 · 3 min read

LLM Agents Can Form a Stable Price Cartel Through Prompt Optimization, New Study Warns

A new ArXiv paper shows that multiple LLM agents can spontaneously develop stable algorithmic collusion through meta-prompt optimization, achieving supra-competitive prices without any explicit agreement. The findings raise serious questions for antitrust law and the regulation of multi-agent systems.

🟡 🤝 Agents April 21, 2026 · 4 min read

NVIDIA OpenShell, Adobe Agents and WPP: Autonomous AI Agents Create Marketing Content in Minutes

Editorialna ilustracija: NVIDIA OpenShell, Adobe Agenti i WPP: autonomni AI agenti kreiraju marketing sadržaj u minutama

NVIDIA expanded its strategic partnerships with Adobe and global marketing agency WPP to launch autonomous AI agents in enterprise marketing. The foundation is the new NVIDIA OpenShell — a secure runtime environment with policy-based isolation — combined with Nemotron models and the Adobe Firefly Foundry visual content generator.

🟢 🤝 Agents April 21, 2026 · 3 min read

AWS ToolSimulator: LLM-Powered AI Agent Testing Without Live API Calls — Shared State Across Multi-Turn Conversations

Editorialna ilustracija: AWS ToolSimulator: LLM-pogonjeno testiranje AI agenata bez živih API poziva — shared state kroz

AWS introduced ToolSimulator, an LLM-powered framework within the Strands Evals platform for safely testing AI agents without executing live API calls. The simulator maintains consistent shared state across multi-turn conversations and generates contextually appropriate responses, enabling testing of agents that send emails or modify databases without real consequences.

🟢 🤝 Agents April 21, 2026 · 3 min read

NVIDIA Releases Nemotron-Personas-Korea: 7 Million Synthetic Personas for Korean AI Agents

NVIDIA and partners have released the open-source dataset Nemotron-Personas-Korea with 7 million synthetic personas grounded in official Korean demographic data. The goal is to enable development of culturally aware AI agents without privacy risks.

🟡 🤝 Agents April 20, 2026 · 3 min read

Experience Compression Spectrum: an architectural framework unifying memory, skills, and rules in LLM agents

Editorial illustration: a continuum of experience compression levels from raw episodes to distilled rules in an LLM agent

The Experience Compression Spectrum is a new architectural framework that positions memory, skills, and rules of LLM agents along a single axis of increasing compression — from episodic memory (5–20×) through procedural skills (50–500×) to declarative rules (1000×+). The analysis reveals that existing systems operate at fixed compression levels and that memory and skills do not communicate with each other.

🟡 🤝 Agents April 20, 2026 · 3 min read

WORC: strengthening the weakest agents in multi-agent systems achieves 82.2% accuracy on reasoning benchmarks

Editorial illustration: a chain of AI agents where the weakest link is strengthened with additional compute resources

WORC (Weak-Link Optimization for Reasoning and Collaboration) is a new framework that, instead of optimising the strongest agents, identifies and strengthens weak links in multi-agent LLM systems. Using meta-learning and swarm intelligence, it finds underperformers and allocates additional reasoning resources to them. The result: 82.2% average accuracy on reasoning benchmarks and improved stability across architectures.

🟡 🤝 Agents April 19, 2026 · 3 min read

Autogenesis: New Protocol for Self-Modifying AI Agents with Versioned Resources and Rollback Mechanism

Editorial illustration: modular system of components with feedback loops and versioned flows

Autogenesis (AGP) is a protocol that models AI agents, prompts, tools, and memory as registered resources with explicit state and versioned interfaces. The Self Evolution Protocol Layer (SEPL) provides a closed-loop operator interface for proposing, evaluating, and committing improvements with an audit trail and rollback, solving the instability problem of agents that iteratively modify their own components.

🟡 🤝 Agents April 19, 2026 · 2 min read

RadAgent: AI Tool That Interprets Chest CT Scans Step by Step with +36% Relative F1 Improvement

Editorial illustration: AI agent analyzing a chest CT scan, medical context without faces

RadAgent is an AI agent for chest CT scan interpretation that outperforms the baseline CT-Chat model by 36.4% relative macro-F1, 19.6% micro-F1, and 41.9% adversarial robustness in a transparent step-by-step process. The tool generates radiology reports with inspectable decision traces and achieves 37% Faithfulness compared to 0% for the baseline.

🟢 🤝 Agents April 19, 2026 · 3 min read

CoopEval: stronger reasoning models are systematically less cooperative in social dilemmas — a counterintuitive finding for multi-agent AI

Editorial illustration: two abstract agents in a social dilemma, elements of game theory

CoopEval is a new benchmark that tests LLM agents in classic social dilemmas such as Prisoner's Dilemma and Public Goods games. A counterintuitive finding: stronger reasoning models defect more often than weaker ones, systematically undermining cooperation in single-shot mixed-motive situations. Important implications for multi-agent AI deployment where an agent must balance its own interests with collective outcomes.

🟢 🤝 Agents April 19, 2026 · 3 min read

Mind DeepResearch: a three-agent framework achieves top results on deep research tasks using 30B models instead of GPT-4-scale

Editorial illustration: three abstract agents collaborating in a research process, network structure

Mind DeepResearch (MindDR) is a new multi-agent framework for deep research that achieves competitive results with models of around 30 billion parameters — the size of Qwen2.5 or DeepSeek class, not GPT-4 or Claude Opus. Architecture: Planning Agent + DeepSearch Agent + Report Agent with a four-stage training pipeline including data synthesis, according to a technical report published April 17, 2026.

🟡 🤝 Agents April 18, 2026 · 3 min read

LangChain and Cisco demonstrate agentic engineering: 93% faster bug detection and 65% faster development

Editorial illustration: a coordinated swarm of AI agents in software development, abstract network visualization

Agentic engineering is an approach in which swarms of AI agents take over the entire software development lifecycle, not just code writing. LangChain and Cisco engineers Renuka Kumar and Prashanth Ramagopal published on April 17, 2026 a reference architecture with Leader and Worker agents, which in Cisco's pilot with 70 users and 512 sessions reduced bug root-cause detection time by 93% and development workflow execution time by 65%.

🟢 🤝 Agents April 18, 2026 · 2 min read

HuggingFace releases Ecom-RLVE-Gym: 8 environments and a 12-axis curriculum for training e-commerce agents with reinforcement learning

Editorial illustration: abstract e-commerce training environment with a network of products and learning paths

The Owlgebra AI team published on April 16, 2026 on the HuggingFace blog the project Ecom-RLVE-Gym — an open framework with 8 verifiable environments for e-commerce conversational agents and algorithmic reward instead of an LLM judge. The system uses a catalog of 2 million products, the Qwen 3 8B model, and a 12-axis adaptive curriculum that incrementally increases task difficulty for the agent, as a response to the limitations of supervised fine-tuning in complex multi-step workflows.

🔴 🤝 Agents April 17, 2026 · 2 min read

OpenAI: Codex for (almost) everything — desktop app with computer use, browsing and plugins

OpenAI Codex is an updated desktop application for macOS and Windows that now integrates computer use, in-app browsing, image generation, persistent memory and a plugin system. Launched on the same day as Anthropic's Opus 4.7, Codex represents the most ambitious attempt to create an all-in-one AI coding assistant with full agentic capabilities.

🟡 🤝 Agents April 17, 2026 · 2 min read

GitHub CLI: new gh skill command enables management of AI agent skills across all platforms

GitHub CLI version 2.90.0 introduces the gh skill command that enables discovery, installation, management and publishing of AI agent skills for GitHub Copilot, Claude Code, Cursor, Codex, Gemini CLI and Antigravity. Supply chain security is ensured through immutable releases, SHA content verification and version pinning.

🟢 🤝 Agents April 17, 2026 · 2 min read

ArXiv OpenMobile: open-source mobile agents with trajectory synthesis and policy-switching

OpenMobile is a new open-source framework for developing mobile agents based on vision-language models. After fine-tuning Qwen2.5-VL, it achieves 51.7% success, and Qwen3-VL reaches 64.7% on the AndroidWorld benchmark — significantly above existing open-data approaches and close to closed systems that reach nearly 70%. The authors release all data and code publicly.

🟢 🤝 Agents April 17, 2026 · 2 min read

LangChain: async subagents bring fire-and-steer paradigm for hundreds of parallel AI agents

LangChain has released a new async subagent model that allows a supervisor agent to launch hundreds of parallel subagent instances without blocking. The fire-and-steer paradigm allows changing instructions to subagents mid-execution through the start_async_task, check_async_task and update_async_task tools, running on the LangSmith platform or self-hosted infrastructure.

🟡 🤝 Agents April 16, 2026 · 2 min read

OpenAI: Next-Generation Agents SDK Introduces Native Sandbox Execution for Reliable Agents

OpenAI has announced a significant upgrade to its Agents SDK, introducing native sandbox execution and a model-native harness for building more reliable long-running AI agents. The new release focuses on code execution safety and agent autonomy, enabling development teams to build agents that can operate for hours without human supervision while maintaining reliability.

🟢 🤝 Agents April 16, 2026 · 2 min read

ArXiv: TREX — Two AI Agents Automate the Entire LLM Fine-Tuning Process

TREX is a new multi-agent system that automates the complete fine-tuning pipeline for large language models — from requirements analysis and literature search to data preparation and results evaluation. The system models the experimental process as a search tree, and on the FT-Bench benchmark with 10 real-world tasks, it consistently optimizes model performance.

🟢 🤝 Agents April 16, 2026 · 2 min read

IBM Research: VAKRA Benchmark Reveals AI Agents Fail on Complex Reasoning

IBM Research has published VAKRA — a new benchmark for evaluating AI agents in enterprise environments, comprising more than 8,000 local APIs, 62 domains, and 4,187 test instances. The key finding is that models display surface-level competence on simple tasks but fail on compositional reasoning, multi-hop reasoning degrades with depth, and adherence to external constraints causes a significant performance drop.

🔴 🤝 Agents April 15, 2026 · 2 min read

ArXiv: Bans Work, Instructions Backfire — Empirical Study of Rules for AI Coding Agents

An analysis of 679 rule files and 25,532 rules from GitHub shows that prohibitions improve AI coding agents, but positive instructions actually hurt them. Random rules perform just as well as expertly written ones.

🟡 🤝 Agents April 15, 2026 · 1 min read

ArXiv: HORIZON — Where and Why AI Agents Fail on Long-Horizon Tasks

The new HORIZON benchmark systematically analyzes how LLM agents fail on long-horizon tasks. The research reveals that errors accumulate across multiple steps, and even the best models lose focus after 20+ actions.

🟡 🤝 Agents April 15, 2026 · 2 min read

ArXiv: PAC-BENCH — What Happens When AI Agents Must Keep Secrets While Collaborating?

The first benchmark for evaluating multi-AI-agent collaboration under privacy constraints. Results show that privacy significantly degrades collaboration quality and causes three types of errors including privacy-induced hallucinations.

🟢 🤝 Agents April 15, 2026 · 2 min read

ArXiv: SWE-AGILE — How Small Models Solve the Context Explosion in Coding Agents

SWE-AGILE introduces a dynamic context strategy with sliding windows and compressed summaries for AI coding agents. With a model of only 7-8B parameters, it achieves a new state-of-the-art on SWE-Bench-Verified, using only 2,200 training examples.

🔴 🤝 Agents April 14, 2026 · 1 min read

OpenAI and Cloudflare: GPT-5.4 and Codex power new Agent Cloud platform for enterprise

Cloudflare has integrated OpenAI's GPT-5.4 and Codex models into its new Agent Cloud platform, enabling enterprise users to build, deploy, and scale AI agents for real-world business tasks with an emphasis on speed and security.

🟡 🤝 Agents April 14, 2026 · 2 min read

AI2: AI agents solve 80% of school-level science but only 20% of real scientific problems

The Allen Institute for AI analyzes two benchmarks that reveal a dramatic gap between AI performance on knowledge tests and the ability to make real scientific discoveries. While models reach 80% at the school level, they drop to 20% on complex scientific tasks.

🟡 🤝 Agents April 14, 2026 · 2 min read

ArXiv HiL-Bench: Do AI agents know when to ask a human for help?

The new HiL-Bench benchmark measures the ability of AI agents to recognize their own limitations and ask for human help instead of guessing. Results show that even frontier models poorly judge when they need help, but targeted training can improve this ability.

🔴 🤝 Agents April 13, 2026 · 2 min read

ArXiv HiL-Bench: no frontier model knows when to ask for help

A new benchmark reveals a universal judgment deficiency in AI agents — when specifications are incomplete, no frontier model achieves more than a fraction of its full performance. Researchers show this skill can be trained with RL.

🟢 🤝 Agents April 13, 2026 · 2 min read

ArXiv SAGE: 27 LLMs tested — models understand intent but don't execute correctly

A new benchmark for customer services reveals two phenomena: 'Execution Gap' (models correctly classify intents but don't perform the correct actions) and 'Empathy Resilience' (models remain polite while making logical errors).

🟡 🤝 Agents April 12, 2026 · 2 min read

GitHub Copilot CLI: Official Beginner's Guide — Delegating Tasks to Cloud Agents from the Terminal

On April 10, GitHub published an official tutorial for the Copilot CLI tool. The guide covers installation via npm, authentication with a GitHub account, and practical examples — including delegating tasks to cloud agents.

🟡 🤝 Agents April 11, 2026 · 2 min read

Anthropic publishes 'Trustworthy agents in practice' policy framework

Anthropic has published a comprehensive policy framework 'Trustworthy agents in practice' that defines what it means to develop, deploy, and use AI agents in a reliable manner. The document serves as a guide for companies building or using agents.

🟡 🤝 Agents April 11, 2026 · 2 min read

ArXiv PASK: proactive AI agents with long-term memory that predict user intent

A new paper, PASK, introduces a framework for proactive AI agents that combine intent detection, hybrid memory, and self-initiated action. The IntentFlow model reached the level of the leading Gemini 3 Flash models in recognizing latent user needs.

🟡 🤝 Agents April 11, 2026 · 2 min read

ArXiv SAVeR: self-auditing for LLM agents — verify before you execute (ACL 2026)

A new method, SAVeR (Self-Audited Verified Reasoning), accepted at ACL 2026, enables LLM agents to audit themselves before executing actions. The goal: to prevent coherent reasoning that violates logical constraints from leading to incorrect decisions.

🟢 🤝 Agents April 11, 2026 · 2 min read

ArXiv KnowU-Bench: new benchmark for interactive and proactive mobile AI agents

Researchers have introduced KnowU-Bench — a comprehensive benchmark for evaluating a new generation of mobile AI agents, focusing on interactivity, proactivity, and personalization through long-term use.

🟡 🤝 Agents April 10, 2026 · 2 min read

AWS Agent Registry: enterprise catalog of AI agents now in preview

Amazon has released a preview of AWS Agent Registry, a centralized catalog of AI agents, tools and agent skills for enterprise organizations. The system indexes agents regardless of where they are hosted (AWS, other clouds, on-premises) and uses a combination of keyword and semantic search along with IAM-based access control.

🟡 🤝 Agents April 10, 2026 · 2 min read

AWS Bedrock AgentCore: stateful MCP client enables interactive AI workflows

Amazon has extended Bedrock AgentCore Runtime with three new MCP capabilities — elicitation (requesting structured input from the user), sampling (requesting LLM completions from the client), and progress notifications. Stateful sessions can now last up to 8 hours in isolated microVMs and enable two-way communication between agent and client.