🤝 Agents

145 articles

🔴 🤝 Agents May 23, 2026 · 4 min read

arXiv:2605.22502: Compiling agentic workflows into LLM weights achieves near-frontier quality at 100× lower cost

Editorial illustration: workflow nodes collapsing into a compact neural network core

Researchers demonstrated that complex agentic workflows can be encoded directly into the weights of a smaller fine-tuned model instead of external orchestration such as LangChain or LangGraph. The approach achieves near-frontier quality at 100× lower inference cost across three real-world scenarios: travel booking, Zoom support, and insurance, with workflows of 14 to 55 nodes.

🔴 🤝 Agents May 23, 2026 · 3 min read

arXiv:2605.22794: MOSS shows agents that self-improve by rewriting their own source code

Editorial illustration: AI agent rewriting its own source code in a sandbox loop

Researchers presented MOSS, a framework for autonomous agents that improve themselves by rewriting their own source code — not just their prompt or fine-tuning weights. On the OpenClaw benchmark, a single MOSS self-evolution cycle raises the score from 0.25 to 0.61 without any human intervention, showing that agents can fix routing, hooks, and dispatch logic that text-only methods cannot touch.

🟡 🤝 Agents May 23, 2026 · 3 min read

arXiv:2605.22535: TerminalWorld benchmark measures LLM agents on real Linux terminal tasks without simulation

Editorial illustration: terminal prompt with git and bash commands and an AI agent executing them

TerminalWorld is a new benchmark that evaluates LLM agents on real bash, git, and file operations in genuine Linux processes — no simulation. The eight-author paper led by Zhaoyang Chu and Jiarui Hu sets a new bar for 'computer use' agents and is directly relevant to tools like Claude Code, GitHub Copilot Workspace, and Cursor's agentic mode.

🟡 🤝 Agents May 23, 2026 · 3 min read

Anthropic Claude Code v2.1.149 brings per-category breakdown in /usage and closes PowerShell permission bypass

Editorial illustration: terminal with usage breakdown chart and a security shield

Anthropic released Claude Code CLI v2.1.149, which extends the /usage command with a cost breakdown by category (skills, subagents, plugins, per-MCP server). The release closes two security vulnerabilities: a PowerShell permission bypass through built-in functions and an incorrect allowlist for the git worktree sandbox. An enterprise setting allowAllClaudeAiMcps was also added for cloud MCP connectors.

🔴 🤝 Agents May 22, 2026 · 3 min read

Microsoft Research: MagenticLite + Fara1.5 (4B/9B/27B) — agentic AI optimised for small models achieves SOTA

Editorial illustration: MagenticLite + Fara1.5 (4B/9B/27B) — agentic AI optimised for small models achieves SOTA

Microsoft Research released a trio for agentic AI with small models on 21 May 2026: MagenticLite (a browser and filesystem UI application), MagenticBrain (a 14B orchestration model fine-tuned from Qwen 3 14B), and Fara1.5 (a computer-use model in 4B, 9B, and 27B variants). Fara1.5-27B reaches over 90% of SOTA on the Online-Mind2Web benchmark (300 web tasks), nearly doubling the performance of the previous Fara-7B. The goal is to demonstrate that agentic AI does not require massive models — only well co-designed tools and a harness.

🟡 🤝 Agents May 22, 2026 · 3 min read

AWS: Nova Act receives HIPAA eligibility — agentic ePHI automation for healthcare workflows

Editorial illustration: Nova Act receives HIPAA eligibility — agentic ePHI automation for healthcare workflows

AWS announced on 21 May 2026 that Amazon Nova Act, the agentic AI service for automating browser and UI workflows, has received formal HIPAA-eligible status. Healthcare organisations can now use Nova Act to work with protected health information (ePHI) — authorising prior authorisations, verifying insurance, and submitting referrals through vendor web portals. The service integrates with Amazon Bedrock AgentCore and the Strand Agents framework, requires a signed BAA agreement and AWS KMS encryption, and currently operates only in the US East (N. Virginia) region.

🟡 🤝 Agents May 22, 2026 · 3 min read

Anthropic: Claude Code v2.1.147 introduces Workflow tool for deterministic multi-agent orchestration

Editorial illustration: Claude Code v2.1.147 introduces Workflow tool for deterministic multi-agent orchestration

Anthropic released Claude Code v2.1.147 on 21 May 2026 at 20:39 UTC — a new CLI version introducing the Workflow tool, the first deterministic multi-agent orchestration mechanism in the Claude Code ecosystem. The tool is initially disabled by default and activated via the CLAUDE_CODE_WORKFLOWS=1 environment variable. The same version renames the existing /simplify command to /code-review with effort levels (high/medium/low) and adds sandbox hardening against prototype-pollution and thenable-based escape attacks.

🟡 🤝 Agents May 22, 2026 · 4 min read

LangChain: From token streams to agent streams — typed channels replace classic streaming for multi-agent UI

Editorial illustration: From token streams to agent streams — typed channels replace classic streaming for multi-agent UI

LangChain published a post on 21 May 2026 by authors Christian Bromann and Nick Hollon describing a paradigmatic evolution from token streams to structured agent streams. Modern AI agents plan tasks, delegate to sub-agents, call tools, and pause for human review — classic text token streaming is insufficient for displaying that work. LangChain proposes typed channels transmitting messages, tool calls, state changes, sub-agent activity, and custom events. Applications subscribe only to relevant event types, making the UI efficient for long-running workloads.

🟡 🤝 Agents May 22, 2026 · 3 min read

OpenAI: Codex scaling to enterprise — 4 million weekly active users and the Codex Labs program

Editorial illustration: Codex scaling to enterprise — 4 million weekly active users and the Codex Labs program

OpenAI announced the enterprise scaling of the Codex platform on 21 May 2026 — the agentic coding tool that has reached 4 million weekly active users. The new Codex Labs program and partnerships with major consulting firms were announced to help large enterprises implement and scale Codex. The news marks a formal enterprise go-to-market move positioning Codex as a direct competitor to GitHub Copilot in the mid and high market segments.

🟡 🤝 Agents May 21, 2026 · 2 min read

Anthropic: MCP Tunnels and self-hosted sandboxes for Claude Managed Agents

Editorial illustration: Anthropic MCP Tunnels for private networks and self-hosted sandboxes for Claude Managed Agents

Anthropic presented MCP Tunnels in Research Preview on 19 May 2026 — a feature enabling Claude agents to connect to Model Context Protocol servers on a user's private network — and self-hosted sandboxes as an alternative to Anthropic's own infrastructure for tool execution. Updates also include dynamic MCP configuration changes within active sessions and automatic overflow of outputs larger than 100K tokens into a sandbox file.

🟡 🤝 Agents May 21, 2026 · 2 min read

Google DeepMind: Co-Scientist multi-agent AI partner for scientific research

Editorial illustration: Google DeepMind Co-Scientist multi-agent AI partner for accelerating scientific research

Google DeepMind announced Co-Scientist on 19 May 2026 — a Gemini-based multi-agent AI system that generates, debates, and refines scientific hypotheses using 6 specialised agents in a Tournament of Ideas debate. The system was developed in collaboration with more than 100 research institutions and has already produced concrete results in liver fibrosis, ALS, cellular ageing, and infectious disease research, with analysis time reduced from months to days.

🟡 🤝 Agents May 21, 2026 · 2 min read

Google: I/O 2026 round 2 — Antigravity 2.0, Gemini Spark and Universal Cart

Editorial illustration: Google I/O 2026 second wave — Antigravity 2.0, Gemini Spark and Universal Cart consolidate the agent-first strategy

At I/O 2026, Google announced the second wave of major AI launches — Antigravity 2.0 as an agent-first development platform with CLI and SDK, Gemini Spark as a persistent personal AI agent running in the background on-device, and Universal Cart as an AI shopping assistant integrated across Google services. The trio follows the Gemini 3.5 Flash and Omni announcements already covered in the previous run, and consolidates Google's agent-first ecosystem strategy.

🟡 🤝 Agents May 21, 2026 · 3 min read

LangChain: Deep Agents get QuickJS interpreters for code between tool calls

Editorial illustration: LangChain Deep Agents with QuickJS interpreters that preserve state between tool calls and reduce token consumption

LangChain introduced interpreters on 20 May 2026 — embedded QuickJS runtime environments in the Deep Agents framework that let agents write and execute code between LLM tool calls without serialising state into the message history. The company claims up to 35 percent lower token consumption on some tasks because state persists within the runtime instead of in the model context, with an explicitly controlled action space that by default has no access to the filesystem, network, or shell.

🟡 🤝 Agents May 20, 2026 · 2 min read

Anthropic Claude Code: Live session scripting and security fixes in v2.1.145

Editorial illustration:

Anthropic Claude Code v2.1.145 brings JSON output of live sessions for scripting, extended OTEL trace attributes for agent tracking, and fixes for a security vulnerability in bash command approval.

🟡 🤝 Agents May 20, 2026 · 2 min read

Anthropic: Claude for 276,000 KPMG employees in 138 countries

Editorial illustration: Anthropic and KPMG have entered into a strategic global alliance giving Claude access to all employees

Anthropic and KPMG have entered into a strategic global alliance giving Claude access to all employees of one of the four largest audit firms in the world. Claude is being embedded in KPMG's Digital Gateway, and KPMG becomes Anthropic's preferred partner for the private equity sector.

🟡 🤝 Agents May 20, 2026 · 2 min read

AWS: Three architectural patterns for scalable voice agents with Amazon Nova Sonic

Editorial illustration:

AWS published a detailed guide for scalable voice agents using Amazon Nova Sonic and AgentCore Gateway. Three clear patterns — direct tools, sub-agents, and session segmentation — offer different tradeoffs between latency and complexity.

🟡 🤝 Agents May 20, 2026 · 2 min read

GitHub Copilot Gets Gemini 3.5 Flash: Speed and Quality for Everyday Coding

Editorial illustration: Google's Gemini 3.5 Flash model becomes generally available for all GitHub Copilot plans

Google's Gemini 3.5 Flash model is becoming generally available for all GitHub Copilot plans. It promises near-Pro-tier quality combined with Flash-tier speed and lower cost, with emphasis on agentic workflows and multiple IDE environments.

🟢 🤝 Agents May 20, 2026 · 2 min read

arXiv:2605.18703: EnvFactory – RL training of tool-use agents with 5× fewer environments

Editorial illustration:

EnvFactory is a new framework for automatically synthesizing executable training environments for tool-use AI agents. Using only 85 verified environments across 7 domains, it achieves +15% on BFCLv3 and +8.6% on MCP-Atlas — roughly 5× more efficient than comparable approaches.

🟢 🤝 Agents May 20, 2026 · 2 min read

arXiv:2605.18565: LongMINT — why AI agents forget everything you tell them

Editorial illustration:

Researchers at the University of North Carolina have published LongMINT — the first benchmark that systematically measures how poorly AI agents manage memory in long, dynamic scenarios. Average accuracy is just 27.9%, worse than random guessing in many cases.

🟢 🤝 Agents May 20, 2026 · 2 min read

arXiv:2605.20173: 6 Architectural Patterns for Production LLM Agents

Editorial illustration: New arXiv paper introduces the stochastic-deterministic boundary as a foundational design principle for production LLM agents

A new arXiv paper introduces the stochastic-deterministic boundary as the foundational design principle for production LLM agents and defines 6 composable runtime patterns — from hierarchical delegation to human-in-the-loop — selected according to three architectural concerns: coordination, state, and control.

🔴 🤝 Agents May 19, 2026 · 3 min read

Anthropic: Acquiring Stainless integrates MCP server tooling and SDK development directly into the Claude platform

Editorial illustration: On May 18, 2026, Anthropic acquired Stainless, a company founded in 2022 behind all official Anthropic SDKs

On May 18, 2026, Anthropic acquired Stainless, a company founded in 2022 that is behind all official Anthropic SDKs and MCP server tooling. Stainless builds SDKs for hundreds of companies, and the acquisition aims to better integrate Claude agents with external data and tools.

🔴 🤝 Agents May 19, 2026 · 3 min read

Anthropic: MCP Tunnels, Self-Hosted Sandboxes and Automatic File-Spill for Agents

Editorial illustration: Anthropic introduces three major Claude API platform updates for agent builders: MCP Tunnels for private networks

Anthropic has introduced three major updates to the Claude API platform for agent builders: MCP Tunnels for connecting to private networks without internet exposure, self-hosted sandboxes as an alternative to Anthropic infrastructure, and automatic file-spill for tool outputs exceeding 100K tokens.

🟡 🤝 Agents May 19, 2026 · 2 min read

arXiv:2605.18661: AI for Automated Research — Roadmap and User Guide

Editorial illustration: arXiv paper 2605.18661 from NUS and NTU researchers analyzing systems that autonomously generate research papers

arXiv paper 2605.18661 from researchers at NUS and NTU analyzes systems that autonomously generate research papers for just $15. Key finding: frontier LLMs fabricate results and cannot reliably assess idea novelty. A comprehensive roadmap defines the boundary between reliable assistance and unsafe AI autonomy.

🟡 🤝 Agents May 19, 2026 · 3 min read

arXiv:2605.16233: FORGE — AI agents develop shared memory without fine-tuning

Editorial illustration: arXiv:2605.16233 presents FORGE, a method by which LLM agents build shared memory through population-based experience sharing

arXiv:2605.16233 presents FORGE, a method by which LLM agents build shared memory through population-based experience sharing — without any model weight updates. On the CybORG CAGE-2 network defense task it achieves 1.7–7.7× better performance over the zero baseline, with particularly pronounced gains for weaker models.

🟡 🤝 Agents May 19, 2026 · 2 min read

Anthropic Claude Code: v2.1.144 Brings /resume for Background Sessions and Fix for 75-Second Hang

Editorial illustration: Claude Code CLI v2.1.144 introduces /resume support for background sessions with duration display like 'Agent completed'

Claude Code CLI v2.1.144 introduces /resume support for background sessions showing duration like 'Agent completed · 3h 2m 5s', fixes the 75-second hang on unavailable API, resolves an MCP tools/list pagination bug that silently lost tools, and delivers a range of terminal and MCP fixes.

🟡 🤝 Agents May 19, 2026 · 2 min read

GitHub: Copilot CLI remote control now generally available on all platforms

Editorial illustration: GitHub announced the general availability of remote control functionality for GitHub Copilot CLI

GitHub announced the general availability (GA) of remote control functionality for GitHub Copilot CLI. With the /remote on command, a developer can monitor and control an active terminal session from a mobile device, web, VS Code or JetBrains IDE — without interrupting the workflow.

🟢 🤝 Agents May 19, 2026 · 3 min read

arXiv:2605.18747: Code as Operational Substrate — A New AI Agent Paradigm

Editorial illustration: 41 researchers from UIUC and NVIDIA argue that code is not just an LLM output but an agent harness — operational substrate

41 researchers from UIUC and NVIDIA argue that code is not merely an LLM output but an agent harness — an operational substrate that unifies reasoning, action and verification into a single framework for building reliable AI systems.

🟢 🤝 Agents May 19, 2026 · 2 min read

arXiv:2605.16238: LLM-guided tree search beats CDC in epidemic forecasting

Editorial illustration: arXiv:2605.16238 presents an autonomous system combining LLMs and tree search algorithms for predicting seasonal epidemics

arXiv:2605.16238 presents an autonomous system combining LLMs and tree search algorithms for predicting seasonal epidemics. In real time, throughout the 2025-26 season, the system independently built models for influenza, COVID-19 and RSV that consistently matched or surpassed the CDC's gold-standard human-curated ensemble.

🟡 🤝 Agents May 18, 2026 · 4 min read

arXiv:2605.16217 Argus: evidence assembly architecture for deep research agents achieves +12.7pp with 8 parallel searchers

Editorial illustration: knowledge graph with evidence nodes and parallel searcher agents around a central navigator.

Argus is a new arXiv paper published on May 15, 2026 by Zhen Zhang, Liangcai Su, Zhuo Chen, and colleagues that presents an evidence assembly framework for deep research agents. The system uses a dual-agent architecture — Searcher (ReAct-style traces) + Navigator (shared evidence graph + RL synthesis) — achieving +5.5pp with a single Searcher, +12.7pp with 8 parallel, and a score of 86.2 on BrowseComp with 64 parallel searchers without exceeding context limits.

🟡 🤝 Agents May 18, 2026 · 4 min read

GitHub Copilot: Grok Code Fast 1 Deprecated May 15, 2026; Recommended Replacements GPT-5 mini and Claude Haiku 4.5

Editorial illustration: deprecated stamp on the xAI Grok icon with arrows toward GPT-5 mini and Claude Haiku 4.5 logos.

GitHub formally deprecated the Grok Code Fast 1 model on May 15, 2026, across all Copilot experiences (Chat, inline edits, ask, agent mode, code completions). The deprecation comes one week after the announcement on May 8. Recommended replacements: GPT-5 mini and Claude Haiku 4.5 — both available through standard model policies. Enterprise admins must enable alternatives through Copilot settings.

🟢 🤝 Agents May 18, 2026 · 4 min read

Databricks + Veeva Vault CRM: three specialized AI agents for life sciences commercial workflows

Editorial illustration: pharma sales rep with tablet and AI agent overlay with patient data dashboard.

On May 18, 2026, Databricks announced a partnership with Veeva Systems that integrates Genie AI agents directly into Vault CRM workflows for the life sciences industry. Three specialized agent personas — Sales Rep Agent, Medical Science Liaison (MSL) Agent, and Territory Manager Agent — access the Databricks lakehouse through Unity Catalog governance. The announcement precedes the Veeva Commercial Summit in Boston (May 19–20, 2026).

🟡 🤝 Agents May 16, 2026 · 3 min read

Anthropic: Claude Code v2.1.143 — 5th patch this week, plugin dependency enforcement and projected context cost in marketplace

Editorial illustration: Claude Code plugin marketplace with token cost icons and a dependency graph.

Claude Code v2.1.143 is the new Anthropic CLI agent release published May 15, 2026. The fifth patch this week following v2.1.139, v2.1.140, v2.1.141 and v2.1.142. Brings plugin dependency enforcement with disable-chain hints, projected context cost display in the plugin marketplace (per-turn and per-invocation token estimates), a new worktree.bgIsolation setting, PowerShell -ExecutionPolicy Bypass auto-flag, and background sessions that preserve model/effort through idle wake.

🟡 🤝 Agents May 16, 2026 · 4 min read

GitHub: Accessibility Agent reviewed 3,535 PRs with a 68 % resolution rate, revealing LLM bias toward accessibility antipatterns

Editorial illustration: accessibility icons (screen reader, keyboard) with a GitHub PR review display.

The GitHub Accessibility Agent is a new general-purpose accessibility automation case study published on May 15, 2026. The agent reviewed 3,535 pull requests with a 68 % resolution rate and uncovered a significant bias: LLMs have an unfortunate tendency to produce accessibility antipatterns because they were trained on decades of inaccessible code. GitHub uses a sequential reviewer+implementer architecture (a two-tier model) instead of parallel sub-agents — this reduced token consumption and improved accuracy.

🟢 🤝 Agents May 16, 2026 · 3 min read

arXiv:2605.14892 Survey: LIFE progression (Lay, Integrate, Find, Evolve) for LLM multi-agent systems

Editorial illustration: a multi-agent system with LIFE stages and inter-agent connections.

The LIFE progression survey is a comprehensive review of multi-agent LLM systems published May 15, 2026 on arXiv by Shihao Qi, Jie Ma, Rui Xing, Wei Guo and 14 co-authors. The survey organizes the field through four causally linked stages — Lay (individual capabilities), Integrate (agent collaboration), Find (failure attribution) and Evolve (autonomous improvement). The central thesis: error propagation across agents creates failures that rarely translate into structural self-improvement.

🟡 🤝 Agents May 15, 2026 · 2 min read

Anthropic: Claude Code v2.1.142 — Fast Mode default switches to Opus 4.7, new --add-dir and --mcp-config flags for background sessions

Editorial illustration: Claude Code terminal with background agent sessions and flag listing.

Claude Code v2.1.142 is the new Anthropic CLI agent release published on May 14, 2026. The fourth patch this week after v2.1.139, v2.1.140, and v2.1.141. It adds eight new flags for claude agents background sessions (--add-dir, --settings, --mcp-config, --plugin-dir, --permission-mode, --model, --effort, --dangerously-skip-permissions). Fast Mode default is now Opus 4.7 (previously Opus 4.6). Fixes MCP tool timeouts, git worktree recognition, macOS sleep daemon, and Windows network drive deadlock.

🟡 🤝 Agents May 15, 2026 · 3 min read

GitHub: Copilot App in Technical Preview — Standalone GitHub-Native Desktop Agent with Isolated Sessions and Agent Merge

Editorial illustration: desktop app with git branch graphic and Agent Merge flow.

GitHub Copilot App is a new standalone GitHub-native desktop application in Technical Preview, announced on May 14, 2026. It differs from the IDE plugin in that it provides isolated sessions per task — each with its own branch, files, conversation state, and task state. Agent Merge functionality autonomously addresses review comments, fixes failing checks, and merges once conditions are met. Available to Copilot Pro/Pro+ via early access and Business/Enterprise via rollout.

🟢 🤝 Agents May 15, 2026 · 3 min read

OpenAI: Codex from Anywhere — Mobile and Web Rollout of Coding Agent with Real-Time Monitoring and Steering Controls

Editorial illustration: smartphone with Codex CLI icon and remote development stream.

OpenAI Codex from Anywhere is a new mobile and web rollout phase for the coding agent, announced on May 14, 2026. Developers can monitor, steer, and approve coding tasks in real time through the ChatGPT mobile app on smartphones and tablets. The rollout extends Codex from Windows Sandbox (May 13) and Codex CLI deployment to heterogeneous computing environments, completing OpenAI's cross-platform strategy.

🟡 🤝 Agents May 14, 2026 · 2 min read

Amazon Nova Sonic + WebRTC: real-time voice agents with Kinesis Video Streams and async tool calling for RAG/MCP

Editorial illustration: voice agent with a WebRTC flow and tool calling arrows toward cloud systems.

Amazon Nova Sonic + WebRTC integration is a new AWS architecture published on May 13, 2026 for real-time voice agent applications. A speech-to-speech event processor orchestrates media and text data events through Kinesis Video Streams WebRTC signaling, while server-side VAD reduces audio tokens. Nova Sonic supports async tool calling to MCP servers, Strands agents and RAG systems — IoT and connected vehicle scenarios are the first demonstrations.

🟡 🤝 Agents May 14, 2026 · 2 min read

Anthropic: Claude Code v2.1.141 adds terminalSequence hook, Bedrock Haiku fix, and Summarize up to here rewind option

Editorial illustration: Claude Code terminal with new hook icons and rewind controls.

Claude Code v2.1.141 is the new Anthropic CLI agent release published on May 13, 2026. The third patch version this week adds a terminalSequence field for hook JSON output, the CLAUDE_CODE_PLUGIN_PREFER_HTTPS and ANTHROPIC_WORKSPACE_ID environment variables, claude agents --cwd path scoping, and a new Rewind menu option Summarize up to here for compressing old context. It fixes a Bedrock/Vertex Haiku model ID race and daemon status on Windows.

🟡 🤝 Agents May 14, 2026 · 2 min read

LangChain: Managed Deep Agents — hosted runtime in LangSmith with durable execution and memory layer

Editorial illustration: hosted agent runtime with memory and tool layers in a cloud environment.

Managed Deep Agents is a new LangChain hosted agent runtime published on May 13, 2026 in private beta within the LangSmith platform. The service provides durable execution, persistent memory, integrated tooling and comprehensive observability — all the infrastructure components needed for production deep agents. The agent definition stays in the repository through standard AGENTS.md and tools.json files.

🟡 🤝 Agents May 14, 2026 · 2 min read

OpenAI: Codex sandbox for Windows introduces controlled filesystem and network restrictions for autonomous agents

Editorial illustration: Codex terminal with security layers around filesystem and network access.

Codex Windows Sandbox is a new OpenAI security architecture published on May 13, 2026, enabling the Codex agent to execute safely on the Windows operating system. The sandbox introduces controlled filesystem access and network restrictions to enable safe, efficient coding agents — Codex becomes a cross-platform tool available to Windows users, not just macOS/Linux developers.

🟡 🤝 Agents May 13, 2026 · 2 min read

Anthropic: Claude Code v2.1.140 fixes /goal hang, hot-reload and Read offset validation

Editorial illustration: developer tool screen with code lines and terminal prompt symbols.

Claude Code v2.1.140 is the new Anthropic CLI agent release published on May 12, 2026, which fixes ten bugs including a silent hang in the /goal command with the disableAllHooks setting, a hot-reload regression in symlinked settings files, enterprise endpoint security startup issues, and offset parameter validation in the Read tool. Subagent type matching now accepts case-insensitive values.

🟡 🤝 Agents May 13, 2026 · 2 min read

arXiv:2605.12061 SAGE: self-evolving graph-memory engine reaches 91.6% Recall@5 on Natural Questions

Editorial illustration: dynamic graph memory with nodes and feedback arrows.

SAGE is a new self-evolving graph-memory engine for LLM agents published on arXiv on 12 May 2026 by Juntong Wang and collaborators from the university. The engine uses a memory writer and memory reader (Graph Foundation Model) feedback loop that autonomously expands and reorganizes. Zero-shot open-domain retrieval achieves 82.5/91.6 Recall@2/5 on Natural Questions, with improvements on LongMemEval and HaluMem hallucination metrics.

🟡 🤝 Agents May 13, 2026 · 2 min read

Google DeepMind: AI Pointer brings Gemini-powered mouse commands to Chrome and Googlebook

Editorial illustration: mouse cursor with glow rays integrated into a browser interface.

AI Pointer is a new experimental Google DeepMind product introduced on May 12, 2026, that integrates the Gemini model into a contextual mouse pointer. Users can point and speak a short command such as 'Fix this' or 'Compare these' without copying content into a separate application. The feature is available in Chrome immediately, while Magic Pointer is coming to the new Googlebook laptop.

🟡 🤝 Agents May 13, 2026 · 2 min read

NVIDIA: OpenShell + SAP Joule Studio bring enterprise governance to autonomous AI agents

Editorial illustration: protective layer around enterprise data flows with policy enforcement symbols.

NVIDIA OpenShell + SAP Joule Studio integration is a new enterprise agent platform announced at the SAP Sapphire conference on May 12, 2026. NVIDIA OpenShell provides an isolation runtime and policy enforcement, SAP Business AI Platform integrates it as a security layer, and Joule Studio offers an agent-building environment. The NemoClaw reference blueprint is available immediately in Joule Studio.

🟢 🤝 Agents May 13, 2026 · 2 min read

arXiv:2605.11814 MedMemoryBench reveals memory saturation in medical agents — 2,000 sessions, 16,000 turns

Editorial illustration: medical agent with memory records and streaming evaluation indicators.

MedMemoryBench is the first benchmark for memory mechanisms in personalized healthcare agents, published on arXiv on 12 May 2026. A team from Zhejiang University built approximately 2,000 sessions and 16,000 turns through a human-agent collaborative pipeline. The main finding: mainstream AI architectures show memory saturation where continuous information influx degrades performance in medical reasoning.

🟡 🤝 Agents May 12, 2026 · 3 min read

arXiv:2605.10344: TMAS — multi-agent test-time scaling sets new records on reasoning benchmarks

Editorial illustration: multiple AI agent nodes connected in collaborative network with hierarchical memory banks, glowing reasoning paths.

TMAS (Test-time Multi-Agent Scaling) is a new approach to test-time compute scaling that organizes LLM inference as a collaboration between specialized agents with hierarchical memory banks. The authors (UC Berkeley + DeepMind) demonstrate surpassing all existing baseline methods (Best-of-N, MCTS, AutoTTS) on MATH-500, AIME 2024, HumanEval, and GPQA Diamond with the same compute budget. It combines reasoning, retrieval, and verification in a single pipeline.

🟡 🤝 Agents May 12, 2026 · 3 min read

AWS: Strands Agents SDK + Exa integration enables agents to autonomously search the web without custom crawlers

Editorial illustration: open-source SDK agent connecting to AI-native search engine, abstract data flows representing autonomous web queries.

AWS Strands Agents SDK is an open-source framework for building autonomous AI agents that has received deep integration with Exa, an AI-native search engine that indexes the web at the semantic level. An agent can now autonomously decide when to search the web, synthesize reports from multiple sources, and cite data — without building custom crawlers or scraper infrastructure. The integration simplifies building web search-enabled agents in a dozen lines of code.

🟡 🤝 Agents May 12, 2026 · 2 min read

Microsoft Research: SocialReasoning-Bench reveals AI agents complete tasks but fail to defend user interests

Editorial illustration: SocialReasoning-Bench reveals AI agents complete tasks but fail to defend user interests

SocialReasoning-Bench is a new Microsoft Research benchmark measuring whether an AI agent defends the user's actual interests during negotiations with other parties — not just whether it completes the task. Results show that models close deals almost perfectly but consistently leave value on the table, with 90%+ ineffective or negligent outcomes in marketplace scenarios.

🟢 🤝 Agents May 12, 2026 · 2 min read

arXiv:2605.07313: agent memory does not scale — HippoRAG loses 16–20 pp reliability as irrelevant sessions accumulate

Editorial illustration: 2605.07313: agent memory does not scale — HippoRAG loses 16–20 pp reliability as irrelevant sessions accumulate

arXiv:2605.07313 is a scale-conditioned evaluation protocol that tests whether agent memory systems remain functional as irrelevant data accumulates. HippoRAG loses 16–20 percentage points of budget-compliant reliability, while LiCoMemory varies depending on model size. The authors (Shao, Lu, Zhang, Luo) conclude that reliability loss is not an isolated phenomenon.

View full archive →