Today in AI

🟡 🏥 In Practice May 23, 2026 · 4 min read

arXiv:2605.22681: CUSP benchmark shows frontier models cannot reliably predict scientific breakthroughs

Editorial illustration: scientific curve with breakthrough point and an AI system missing the prediction

The CUSP benchmark tests AI models' ability to predict scientific breakthroughs from a database of 4,700 events. Frontier models (GPT-5, Claude Opus 4.7, Gemini 3 Pro) identify plausible research directions but systematically miscalibrate outcomes and timing with overconfidence. Additional pre-cutoff context does not help — the limitation is structural, not informational.

🟡 ✨ Curiosities May 23, 2026 · 4 min read

arXiv:2605.22763: AI agent with Lean verification solves 9 open Erdős problems and 44 OEIS conjectures

Editorial illustration: mathematical symbols and Lean types connected into a formal proof tree

A team of 20 researchers from DeepMind and MIT CSAIL published the first large-scale evaluation of LLMs for autonomous generation of formal proofs in the Lean theorem prover. The agent combines LLM generation with Lean symbolic verification and autonomously solves 9 of 353 open Erdős problems and proves 44 of 492 OEIS conjectures.

🟡 🛡️ Security May 23, 2026 · 4 min read

arXiv:2605.22786: LCGuard protects shared KV cache between agents in multi-agent systems from data leakage

Editorial illustration: boundary between two agent zones with a cryptographic shield around the KV cache

LCGuard is a new framework for protecting against data leakage in multi-agent systems that share a KV cache for efficiency. The paper by IBM Research and MIT researchers led by Sadie Asif presents the first formal model for a 'latent communication guard' approach, applicable to production agentic RAG systems where multiple agents share context through a common memory.

🟡 🤝 Agents May 23, 2026 · 3 min read

arXiv:2605.22535: TerminalWorld benchmark measures LLM agents on real Linux terminal tasks without simulation

Editorial illustration: terminal prompt with git and bash commands and an AI agent executing them

TerminalWorld is a new benchmark that evaluates LLM agents on real bash, git, and file operations in genuine Linux processes — no simulation. The eight-author paper led by Zhaoyang Chu and Jiarui Hu sets a new bar for 'computer use' agents and is directly relevant to tools like Claude Code, GitHub Copilot Workspace, and Cursor's agentic mode.

🟡 🤝 Agents May 23, 2026 · 3 min read

Anthropic Claude Code v2.1.149 brings per-category breakdown in /usage and closes PowerShell permission bypass

Editorial illustration: terminal with usage breakdown chart and a security shield

Anthropic released Claude Code CLI v2.1.149, which extends the /usage command with a cost breakdown by category (skills, subagents, plugins, per-MCP server). The release closes two security vulnerabilities: a PowerShell permission bypass through built-in functions and an incorrect allowlist for the git worktree sandbox. An enterprise setting allowAllClaudeAiMcps was also added for cloud MCP connectors.

🟡 🏥 In Practice May 23, 2026 · 3 min read

GitHub: Gartner Magic Quadrant 2026 — GitHub Copilot Leader for the third consecutive year in Enterprise AI Coding Agents

Editorial illustration: quadrant matrix with GitHub Copilot positioned in the Leader sector

Gartner positioned GitHub as a Leader in its 2026 Magic Quadrant report for Enterprise AI Coding Agents — for the third consecutive year since the category was created. GitHub Copilot is currently used by 140,000 organizations worldwide, and the evaluation emphasized agentic workflows covering the full SDLC from code to review, security, and governance, not just code generation.

🟡 🛡️ Security May 23, 2026 · 4 min read

GitHub: npm 11.15.0 introduces staged publishing and three new install-time --allow flags for supply chain hardening

Editorial illustration: npm package in a staging compartment with a key and security filter

GitHub released npm CLI version 11.15.0, which introduces staged publishing — packages now require maintainer approval before becoming available for installation. A set of three new install-time flags (--allow-file, --allow-remote, --allow-directory) alongside the existing --allow-git was also introduced for granular control over dependency sources in the npm install command.

🟢 🔧 Hardware May 23, 2026 · 4 min read

AMD: Gluon block-level model enables GEMM kernels with 5.255 TFLOPS MXFP4 on Instinct MI355

Editorial illustration: GPU accelerator with matrix unit layout and pipeline flows

The AMD ROCm team published a tutorial for writing high-performance GEMM kernels in the Gluon programming model on the MI355 GPU. An optimized FP16 kernel achieves 1.489 TFLOPS at 98.75 percent MFMA efficiency, while extensions to BF8 (3.257 TFLOPS) and MXFP4 (5.255 TFLOPS) demonstrate relevance for modern AI workloads. The tutorial includes workgroup remapping and swizzle that reduces L2 cache misses from 5.3 M to 4.1 M.

🟢 🏥 In Practice May 23, 2026 · 4 min read

arXiv:2605.22337: Meta-Soft introduces KV cache compression via composable meta-tokens and learnable orthogonal bases

Editorial illustration: meta-tokens compressing attention cache into an orthogonal basis structure

Researchers presented Meta-Soft, a new method for dynamic KV cache compression in LLM inference. The approach uses a learnable orthogonal basis matrix and a selector network that synthesize soft meta-tokens — a compressed representation of key information from a long prompt. An attention-flow mechanism redistributes semantic information from removed tokens into retained ones, outperforming existing KV cache eviction methods.

🟢 🏥 In Practice May 23, 2026 · 4 min read

arXiv:2605.22664: WorkstreamBench tests LLM agents on end-to-end spreadsheet tasks in finance — and frontier models fail

Editorial illustration: Excel spreadsheet with formulas and an AI agent analyzing them

WorkstreamBench is a new benchmark from a 10-author team led by Thomson Yen that tests LLM agents on real Excel and spreadsheet tasks in the financial domain — invoices, reports, cost analysis. GPT-4o, Claude, and Gemini are compared and none passes reliably through the full task set, pointing to structural shortcomings in current agentic infrastructure for enterprise finance.

🟢 🏥 In Practice May 23, 2026 · 2 min read

Anthropic Claude Code v2.1.150 — internal infrastructure patch with no user-facing changes

Editorial illustration: Claude Code terminal with version numbering and internal cogwheels

Anthropic released Claude Code CLI version v2.1.150 at 04:03 UTC on Saturday, just one day after v2.1.149. The release contains exclusively internal infrastructure improvements with no user-facing changes. Available for Darwin, Linux, and Windows on ARM64 and x64 architectures, as well as Linux musl builds.

🟢 📦 Open Source May 23, 2026 · 4 min read

Kedro: version 1.2.0 brings the @experimental decorator and a LangGraph agentic starter for GenAI pipelines

Editorial illustration: pipeline nodes with LangGraph orchestration bridge and Mermaid diagram

Linux Foundation AI project Kedro released version 1.2.0 along with Kedro-Viz 12.3.0. The new @experimental decorator enables marking APIs under development, and the starter project support-agent-langgraph demonstrates integration with LangGraph orchestration and Langfuse/Opik prompt management tools. Kedro-Viz gains Mermaid diagrams and node preview extensibility for improved pipeline debugging.

Yesterday May 22, 2026

All news from May 22, 2026 →
🔴 ⚖️ Regulation May 22, 2026 · 3 min read

UK AI Safety Institute: Overseeing advanced AI systems is becoming harder — 20+ degradation pathways identified

Editorial illustration: Overseeing advanced AI systems is becoming harder — 20+ degradation pathways identified

UK AI Safety Institute (AISI) published a report on 21 May 2026 analysing the future of oversight over advanced AI systems, based on 25 expert interviews from industry, government, and academia. The main finding: existing oversight rests on foundations that are likely to erode. More than 20 distinct degradation pathways for oversight mechanisms have been identified, with particular focus on latent reasoning, capability masking, external AI actions, and AI-to-AI communication.

🔴 🤝 Agents May 22, 2026 · 3 min read

Microsoft Research: MagenticLite + Fara1.5 (4B/9B/27B) — agentic AI optimised for small models achieves SOTA

Editorial illustration: MagenticLite + Fara1.5 (4B/9B/27B) — agentic AI optimised for small models achieves SOTA

Microsoft Research released a trio for agentic AI with small models on 21 May 2026: MagenticLite (a browser and filesystem UI application), MagenticBrain (a 14B orchestration model fine-tuned from Qwen 3 14B), and Fara1.5 (a computer-use model in 4B, 9B, and 27B variants). Fara1.5-27B reaches over 90% of SOTA on the Online-Mind2Web benchmark (300 web tasks), nearly doubling the performance of the previous Fara-7B. The goal is to demonstrate that agentic AI does not require massive models — only well co-designed tools and a harness.

🟡 🏥 In Practice May 22, 2026 · 3 min read

arXiv:2605.21427: PALS — power-aware LLM serving for MoE models achieves +26.3% energy efficiency and 4-7× fewer QoS violations

Editorial illustration: arXiv:2605.21427 — PALS achieves +26.3% energy efficiency and 4-7× fewer QoS violations in MoE LLM serving

Researchers published PALS on 21 May 2026 on the arXiv preprint server — a runtime system that integrates GPU power control directly into LLM serving for Mixture-of-Experts models. PALS uses lightweight offline power-performance models and a feedback controller that dynamically optimises configurations against throughput targets. It achieves 26.3% improvement in energy efficiency and 4-7× reduction in QoS violations under power constraints, integrates into vLLM without modifying the API or retraining models. It addresses a growing operational pain point for data centres — GPU cluster energy consumption that is becoming the dominant constraint on growth.

🟡 🤖 Models May 22, 2026 · 3 min read

arXiv:2605.21006: Off-the-shelf persona vectors achieve 68-98% effectiveness of targeted sycophancy steering in LLM models

Editorial illustration: arXiv:2605.21006 — Off-the-shelf persona vectors achieve 68-98% effectiveness of targeted sycophancy steering in LLM models

Researchers published a paper on arXiv on 21 May 2026 titled 'Playing Devil's Advocate' showing that existing persona vectors developed for roleplay tasks can reduce sycophancy (the model's tendency to agree with the user even when the user is wrong) to 68-98% of the effectiveness of specialised Contrastive Activation Addition (CAA) — without training on sycophancy-specific data. Geometric analysis reveals that sycophancy is a persona-level property rather than a single steerable direction in activation space, opening much easier pathways for alignment.

🟡 🤝 Agents May 22, 2026 · 3 min read

AWS: Nova Act receives HIPAA eligibility — agentic ePHI automation for healthcare workflows

Editorial illustration: Nova Act receives HIPAA eligibility — agentic ePHI automation for healthcare workflows

AWS announced on 21 May 2026 that Amazon Nova Act, the agentic AI service for automating browser and UI workflows, has received formal HIPAA-eligible status. Healthcare organisations can now use Nova Act to work with protected health information (ePHI) — authorising prior authorisations, verifying insurance, and submitting referrals through vendor web portals. The service integrates with Amazon Bedrock AgentCore and the Strand Agents framework, requires a signed BAA agreement and AWS KMS encryption, and currently operates only in the US East (N. Virginia) region.

🟡 🤝 Agents May 22, 2026 · 3 min read

Anthropic: Claude Code v2.1.147 introduces Workflow tool for deterministic multi-agent orchestration

Editorial illustration: Claude Code v2.1.147 introduces Workflow tool for deterministic multi-agent orchestration

Anthropic released Claude Code v2.1.147 on 21 May 2026 at 20:39 UTC — a new CLI version introducing the Workflow tool, the first deterministic multi-agent orchestration mechanism in the Claude Code ecosystem. The tool is initially disabled by default and activated via the CLAUDE_CODE_WORKFLOWS=1 environment variable. The same version renames the existing /simplify command to /code-review with effort levels (high/medium/low) and adds sandbox hardening against prototype-pollution and thenable-based escape attacks.

Earlier news

Thursday, May 21, 2026

12 articles →
🔴 ⚖️ Regulation May 21, 2026 · 3 min read

EU AI Office: draft guidelines for classifying high-risk AI systems

Editorial illustration: EU AI Office opens consultation on classification of high-risk AI systems under the AI Act

The European Commission opened a targeted public consultation on 13 May 2026 on draft guidelines for classifying AI systems as high-risk under the EU AI Act. The consultation closes on 22 May at 18:00 CET, and the guidelines will directly determine which organisations in healthcare, education, critical infrastructure, and HR processes must meet the strictest regulatory requirements.

🔴 🛡️ Security May 21, 2026 · 3 min read

GitHub: malicious VS Code extension breached ~3,800 internal repositories

Editorial illustration: GitHub internal repositories compromised via malicious VS Code extension from a single employee endpoint

GitHub disclosed on 18 May 2026 that an attacker accessed approximately 3,800 internal GitHub repositories via a malicious third-party VS Code extension that infected one employee's device. The investigation is ongoing; the company states there is no evidence of user data being compromised beyond the internal repositories. This is the second major incident in which IDE extensions have become attack vectors against enterprise developer infrastructure.

🔴 🤖 Models May 21, 2026 · 2 min read

OpenAI: AI model disproves 80-year-old conjecture in discrete geometry

Editorial illustration: OpenAI AI model disproves 80-year-old unit distance conjecture in discrete geometry

OpenAI announced that its AI model solved the open unit distance problem — a central conjecture in discrete geometry posed over 80 years ago. The company describes the result as a milestone in AI-driven mathematics, because the model did not merely verify an existing thesis but disproved it by constructing an original counterexample.

🟡 🔧 Hardware May 21, 2026 · 2 min read

AMD: ROCm 7.13 brings MI350P GPU, multi-VF virtualisation and TheRock packaging

Editorial illustration: AMD ROCm 7.13 with MI350P GPU, multi-VF virtualisation and TheRock modular packaging

AMD released ROCm 7.13 on 20 May 2026 — a new version of its open-source AI compute stack that introduces support for the MI350P GPU, virtualisation of up to 8 isolated vGPUs per MI300X accelerator, an open-source ROCprof Trace decoder for transparent performance analysis, and modular TheRock packaging with domain-specific SDKs. The release is validated on Ubuntu 26.04 and RHEL 9.6, and includes VMware ESXi 9.1 support for MI350X and MI355X.

Wednesday, May 20, 2026

18 articles →

Tuesday, May 19, 2026

17 articles →
🔴 🤝 Agents May 19, 2026 · 3 min read

Anthropic: Acquiring Stainless integrates MCP server tooling and SDK development directly into the Claude platform

Editorial illustration: On May 18, 2026, Anthropic acquired Stainless, a company founded in 2022 behind all official Anthropic SDKs

On May 18, 2026, Anthropic acquired Stainless, a company founded in 2022 that is behind all official Anthropic SDKs and MCP server tooling. Stainless builds SDKs for hundreds of companies, and the acquisition aims to better integrate Claude agents with external data and tools.

🔴 🤝 Agents May 19, 2026 · 3 min read

Anthropic: MCP Tunnels, Self-Hosted Sandboxes and Automatic File-Spill for Agents

Editorial illustration: Anthropic introduces three major Claude API platform updates for agent builders: MCP Tunnels for private networks

Anthropic has introduced three major updates to the Claude API platform for agent builders: MCP Tunnels for connecting to private networks without internet exposure, self-hosted sandboxes as an alternative to Anthropic infrastructure, and automatic file-spill for tool outputs exceeding 100K tokens.

🔴 🤖 Models May 19, 2026 · 4 min read

arXiv:2605.15514: RoPE mathematically cannot distinguish positions or tokens in long contexts — theoretical proof of a fundamental limitation

Editorial illustration: arXiv paper 2605.15514 provides a mathematical proof that Rotary Positional Embeddings (RoPE) loses ability to distinguish positions in long contexts

arXiv paper 2605.15514 provides a mathematical proof that Rotary Positional Embeddings (RoPE), the positional mechanism used by nearly all modern large language models including Llama, Mistral, Qwen and GPT-NeoX, loses the ability to distinguish positions and tokens in long contexts. The authors conclude that fundamentally new architectural mechanisms are needed.

🟡 🤝 Agents May 19, 2026 · 2 min read

arXiv:2605.18661: AI for Automated Research — Roadmap and User Guide

Editorial illustration: arXiv paper 2605.18661 from NUS and NTU researchers analyzing systems that autonomously generate research papers

arXiv paper 2605.18661 from researchers at NUS and NTU analyzes systems that autonomously generate research papers for just $15. Key finding: frontier LLMs fabricate results and cannot reliably assess idea novelty. A comprehensive roadmap defines the boundary between reliable assistance and unsafe AI autonomy.

Monday, May 18, 2026

11 articles →
🟡 🤝 Agents May 18, 2026 · 4 min read

arXiv:2605.16217 Argus: evidence assembly architecture for deep research agents achieves +12.7pp with 8 parallel searchers

Editorial illustration: knowledge graph with evidence nodes and parallel searcher agents around a central navigator.

Argus is a new arXiv paper published on May 15, 2026 by Zhen Zhang, Liangcai Su, Zhuo Chen, and colleagues that presents an evidence assembly framework for deep research agents. The system uses a dual-agent architecture — Searcher (ReAct-style traces) + Navigator (shared evidence graph + RL synthesis) — achieving +5.5pp with a single Searcher, +12.7pp with 8 parallel, and a score of 86.2 on BrowseComp with 64 parallel searchers without exceeding context limits.

🟡 📦 Open Source May 18, 2026 · 3 min read

arXiv:2605.15041 CAST Framework: Case-Based Calibration for LLM Tool Use Achieves +5.85pp BFCLv2 and -26% Reasoning Length

Editorial illustration: LLM agent with a case library view and tool call validation indicators.

CAST is a new arXiv paper published on May 14, 2026, by Renning Pang, Tian Lan, Leyuan Liu, Piao Tong, Sheng Cao, and Xiaosong Zhang, introducing a case-based calibration framework for LLM tool use. The approach treats historical execution trajectories as structured information for reinforcement learning — achieving up to +5.85 percentage points execution accuracy improvement over the BFCLv2 baseline and a 26% reduction in average reasoning length.

🟡 🛡️ Security May 18, 2026 · 5 min read

arXiv:2605.15338 Sleeper Memory Poisoning: 99.8% attack success rate on GPT-5.5 via persistent memory of LLM agents

Editorial illustration: LLM agent memory store with dormant adversarial tokens and wake-up trigger icons.

Hidden in Memory is a new arXiv paper published on May 14, 2026 by Sidharth Pulipaka, Stanislau Hlebik, Leonidas Raghav, Sahar Abdelnabi, Vyas Raina, Ivaxi Sheth, and Mario Fritz that presents a delayed-execution attack on stateful LLM agents. Adversarial content in external context (documents, webpages) corrupts the agent's persistent memory — 99.8% success on GPT-5.5 and 95% on Kimi-K2.6, with 60–89% success converting poisoned memory into attacker-intended actions.

🟡 🤖 Models May 18, 2026 · 4 min read

GitHub Copilot: GPT-5.3-Codex becomes base model for Business and Enterprise with 12-month LTS guarantee

Editorial illustration: GitHub Copilot logo with GPT-5.3-Codex badge and LTS support stamp.

On May 17, 2026, GitHub announced that GPT-5.3-Codex replaces GPT-4.1 as the base model for Copilot Business and Enterprise. The change applies only to enterprise tiers (not Copilot Pro, Pro+, or Free). GPT-5.3-Codex is the first LTS (long-term support) model — guaranteed availability for 12 months from February 5, 2026 to February 4, 2027. Pricing: 1× premium request multiplier; GPT-4.1 remains force-enabled at 0× multiplier until deprecation on June 1, 2026.

Saturday, May 16, 2026

15 articles →
🟡 🤝 Agents May 16, 2026 · 3 min read

Anthropic: Claude Code v2.1.143 — 5th patch this week, plugin dependency enforcement and projected context cost in marketplace

Editorial illustration: Claude Code plugin marketplace with token cost icons and a dependency graph.

Claude Code v2.1.143 is the new Anthropic CLI agent release published May 15, 2026. The fifth patch this week following v2.1.139, v2.1.140, v2.1.141 and v2.1.142. Brings plugin dependency enforcement with disable-chain hints, projected context cost display in the plugin marketplace (per-turn and per-invocation token estimates), a new worktree.bgIsolation setting, PowerShell -ExecutionPolicy Bypass auto-flag, and background sessions that preserve model/effort through idle wake.

🟡 🛡️ Security May 16, 2026 · 3 min read

arXiv:2605.14912 Sycophantic Consensus to Pluralistic Repair: AI alignment must surface disagreement, not consensus

Editorial illustration: an AI conversation with dialogue bubbles showing disagreement and different perspectives.

From Sycophantic Consensus to Pluralistic Repair is a new alignment paper by Varad Vishwarupe, Nigel Shadbolt and Marina Jirotka published May 15, 2026 on arXiv. The authors argue that current pluralistic alignment is fundamentally misfocused on preference aggregation rather than surfacing disagreement. They propose the Pluralistic Repair Score (PRS) metric tested on Claude Sonnet 4.5 (N=198) and GPT-4o (N=100) — both models showed agreement-following behavior with low repair quality.

🟡 🤖 Models May 16, 2026 · 3 min read

Black Forest Labs: FLUX Outpainting extends images in any direction while preserving light, texture, and composition

Editorial illustration: an image expanding beyond its frame with preserved light and texture.

FLUX Outpainting is a new Black Forest Labs image generation feature announced on May 14, 2026, that extends images in any direction through a purpose-built expansion endpoint. The user specifies target canvas dimensions and placement coordinates — the model preserves lighting, texture, depth, and composition across extension regions without text prompts. Up to 4MP output, available via the BFL API, with a public demo at flux-tools.bfl.ai/outpainting.

🟡 🤝 Agents May 16, 2026 · 4 min read

GitHub: Accessibility Agent reviewed 3,535 PRs with a 68 % resolution rate, revealing LLM bias toward accessibility antipatterns

Editorial illustration: accessibility icons (screen reader, keyboard) with a GitHub PR review display.

The GitHub Accessibility Agent is a new general-purpose accessibility automation case study published on May 15, 2026. The agent reviewed 3,535 pull requests with a 68 % resolution rate and uncovered a significant bias: LLMs have an unfortunate tendency to produce accessibility antipatterns because they were trained on decades of inaccessible code. GitHub uses a sequential reviewer+implementer architecture (a two-tier model) instead of parallel sub-agents — this reduced token consumption and improved accuracy.