Tuesday, May 12, 2026

14 articles — 🟡 11 important , 🟢 3 interesting

← Previous day Next day →

🤖 Models (2)

🤝 Agents (4)

🟡 🤝 Agents May 12, 2026 · 3 min read

arXiv:2605.10344: TMAS — multi-agent test-time scaling sets new records on reasoning benchmarks

Editorial illustration: multiple AI agent nodes connected in collaborative network with hierarchical memory banks, glowing reasoning paths.

TMAS (Test-time Multi-Agent Scaling) is a new approach to test-time compute scaling that organizes LLM inference as a collaboration between specialized agents with hierarchical memory banks. The authors (UC Berkeley + DeepMind) demonstrate surpassing all existing baseline methods (Best-of-N, MCTS, AutoTTS) on MATH-500, AIME 2024, HumanEval, and GPQA Diamond with the same compute budget. It combines reasoning, retrieval, and verification in a single pipeline.

🟡 🤝 Agents May 12, 2026 · 3 min read

AWS: Strands Agents SDK + Exa integration enables agents to autonomously search the web without custom crawlers

Editorial illustration: open-source SDK agent connecting to AI-native search engine, abstract data flows representing autonomous web queries.

AWS Strands Agents SDK is an open-source framework for building autonomous AI agents that has received deep integration with Exa, an AI-native search engine that indexes the web at the semantic level. An agent can now autonomously decide when to search the web, synthesize reports from multiple sources, and cite data — without building custom crawlers or scraper infrastructure. The integration simplifies building web search-enabled agents in a dozen lines of code.

🟡 🤝 Agents May 12, 2026 · 2 min read

Microsoft Research: SocialReasoning-Bench reveals AI agents complete tasks but fail to defend user interests

Editorial illustration: SocialReasoning-Bench reveals AI agents complete tasks but fail to defend user interests

SocialReasoning-Bench is a new Microsoft Research benchmark measuring whether an AI agent defends the user's actual interests during negotiations with other parties — not just whether it completes the task. Results show that models close deals almost perfectly but consistently leave value on the table, with 90%+ ineffective or negligent outcomes in marketplace scenarios.

🟢 🤝 Agents May 12, 2026 · 2 min read

arXiv:2605.07313: agent memory does not scale — HippoRAG loses 16–20 pp reliability as irrelevant sessions accumulate

Editorial illustration: 2605.07313: agent memory does not scale — HippoRAG loses 16–20 pp reliability as irrelevant sessions accumulate

arXiv:2605.07313 is a scale-conditioned evaluation protocol that tests whether agent memory systems remain functional as irrelevant data accumulates. HippoRAG loses 16–20 percentage points of budget-compliant reliability, while LiCoMemory varies depending on model size. The authors (Shao, Lu, Zhang, Luo) conclude that reliability loss is not an isolated phenomenon.

🔧 Hardware (2)

🏥 In Practice (3)

🟡 🏥 In Practice May 12, 2026 · 2 min read

Anthropic: Claude Code v2.1.139 — Agent View showing all sessions + /goal command for autonomous completion

Editorial illustration: Claude Code v2.1.139 — Agent View showing all sessions + /goal command for autonomous completion

Claude Code v2.1.139 is a release of Anthropic's CLI agent that introduces Agent View in Research Preview — a unified list of all sessions (active, blocked, completed) — and the /goal command that drives Claude through multiple turns until a set condition is met, with a panel showing elapsed time, step count, and token consumption.

🟡 🏥 In Practice May 12, 2026 · 3 min read

IBM: Red Hat AI Inference and OpenShift Virtualization Service announced as managed products on IBM Cloud

Editorial illustration: enterprise cloud infrastructure with red and blue glow, abstract servers running inference workloads, hybrid VM and container orchestration.

IBM today announced Red Hat AI Inference Service and Red Hat OpenShift Virtualization Service as managed enterprise products available on IBM Cloud. The first offers an optimized serving environment for open-source LLMs (Granite, Llama, Mistral) with automatic scaling and SLA guarantees; the second enables running VMs and containers within the same OpenShift control plane. The goal is to reduce the operational burden on enterprise teams that want open-source AI without their own Kubernetes infrastructure.

🟡 🏥 In Practice May 12, 2026 · 3 min read

OpenAI: DeployCo — new standalone organization for enterprise AI deployment announced alongside Q1 2026 results

Editorial illustration: enterprise consulting handshake with abstract AI infrastructure pipelines, deployment lifecycle visualization.

OpenAI on Tuesday launched DeployCo (The Deployment Company), a separate organization that helps enterprises build and scale AI applications in production. The goal is to separate foundation model R&D from enterprise deployment consulting, which until now lived in the same OpenAI team and created operational tension. DeployCo offers managed deployment, custom evaluation, post-launch monitoring, and industry-specific fine-tuning.

💬 Community (2)

🛡️ Security (1)

← Previous day Next day →