Saturday, May 23, 2026

15 articles — 🔴 3 critical , 🟡 7 important , 🟢 5 interesting

← Previous day

📦 Open Source (1)

🤝 Agents (4)

🔴 🤝 Agents May 23, 2026 · 4 min read

arXiv:2605.22502: Compiling agentic workflows into LLM weights achieves near-frontier quality at 100× lower cost

Editorial illustration: workflow nodes collapsing into a compact neural network core

Researchers demonstrated that complex agentic workflows can be encoded directly into the weights of a smaller fine-tuned model instead of external orchestration such as LangChain or LangGraph. The approach achieves near-frontier quality at 100× lower inference cost across three real-world scenarios: travel booking, Zoom support, and insurance, with workflows of 14 to 55 nodes.

🔴 🤝 Agents May 23, 2026 · 3 min read

arXiv:2605.22794: MOSS shows agents that self-improve by rewriting their own source code

Editorial illustration: AI agent rewriting its own source code in a sandbox loop

Researchers presented MOSS, a framework for autonomous agents that improve themselves by rewriting their own source code — not just their prompt or fine-tuning weights. On the OpenClaw benchmark, a single MOSS self-evolution cycle raises the score from 0.25 to 0.61 without any human intervention, showing that agents can fix routing, hooks, and dispatch logic that text-only methods cannot touch.

🟡 🤝 Agents May 23, 2026 · 3 min read

arXiv:2605.22535: TerminalWorld benchmark measures LLM agents on real Linux terminal tasks without simulation

Editorial illustration: terminal prompt with git and bash commands and an AI agent executing them

TerminalWorld is a new benchmark that evaluates LLM agents on real bash, git, and file operations in genuine Linux processes — no simulation. The eight-author paper led by Zhaoyang Chu and Jiarui Hu sets a new bar for 'computer use' agents and is directly relevant to tools like Claude Code, GitHub Copilot Workspace, and Cursor's agentic mode.

🟡 🤝 Agents May 23, 2026 · 3 min read

Anthropic Claude Code v2.1.149 brings per-category breakdown in /usage and closes PowerShell permission bypass

Editorial illustration: terminal with usage breakdown chart and a security shield

Anthropic released Claude Code CLI v2.1.149, which extends the /usage command with a cost breakdown by category (skills, subagents, plugins, per-MCP server). The release closes two security vulnerabilities: a PowerShell permission bypass through built-in functions and an incorrect allowlist for the git worktree sandbox. An enterprise setting allowAllClaudeAiMcps was also added for cloud MCP connectors.

🔧 Hardware (1)

🏥 In Practice (5)

🟡 🏥 In Practice May 23, 2026 · 4 min read

arXiv:2605.22681: CUSP benchmark shows frontier models cannot reliably predict scientific breakthroughs

Editorial illustration: scientific curve with breakthrough point and an AI system missing the prediction

The CUSP benchmark tests AI models' ability to predict scientific breakthroughs from a database of 4,700 events. Frontier models (GPT-5, Claude Opus 4.7, Gemini 3 Pro) identify plausible research directions but systematically miscalibrate outcomes and timing with overconfidence. Additional pre-cutoff context does not help — the limitation is structural, not informational.

🟡 🏥 In Practice May 23, 2026 · 3 min read

GitHub: Gartner Magic Quadrant 2026 — GitHub Copilot Leader for the third consecutive year in Enterprise AI Coding Agents

Editorial illustration: quadrant matrix with GitHub Copilot positioned in the Leader sector

Gartner positioned GitHub as a Leader in its 2026 Magic Quadrant report for Enterprise AI Coding Agents — for the third consecutive year since the category was created. GitHub Copilot is currently used by 140,000 organizations worldwide, and the evaluation emphasized agentic workflows covering the full SDLC from code to review, security, and governance, not just code generation.

🟢 🏥 In Practice May 23, 2026 · 4 min read

arXiv:2605.22337: Meta-Soft introduces KV cache compression via composable meta-tokens and learnable orthogonal bases

Editorial illustration: meta-tokens compressing attention cache into an orthogonal basis structure

Researchers presented Meta-Soft, a new method for dynamic KV cache compression in LLM inference. The approach uses a learnable orthogonal basis matrix and a selector network that synthesize soft meta-tokens — a compressed representation of key information from a long prompt. An attention-flow mechanism redistributes semantic information from removed tokens into retained ones, outperforming existing KV cache eviction methods.

🟢 🏥 In Practice May 23, 2026 · 4 min read

arXiv:2605.22664: WorkstreamBench tests LLM agents on end-to-end spreadsheet tasks in finance — and frontier models fail

Editorial illustration: Excel spreadsheet with formulas and an AI agent analyzing them

WorkstreamBench is a new benchmark from a 10-author team led by Thomson Yen that tests LLM agents on real Excel and spreadsheet tasks in the financial domain — invoices, reports, cost analysis. GPT-4o, Claude, and Gemini are compared and none passes reliably through the full task set, pointing to structural shortcomings in current agentic infrastructure for enterprise finance.

🟢 🏥 In Practice May 23, 2026 · 2 min read

Anthropic Claude Code v2.1.150 — internal infrastructure patch with no user-facing changes

Editorial illustration: Claude Code terminal with version numbering and internal cogwheels

Anthropic released Claude Code CLI version v2.1.150 at 04:03 UTC on Saturday, just one day after v2.1.149. The release contains exclusively internal infrastructure improvements with no user-facing changes. Available for Darwin, Linux, and Windows on ARM64 and x64 architectures, as well as Linux musl builds.

🛡️ Security (3)

🔴 🛡️ Security May 23, 2026 · 3 min read

Anthropic: Project Glasswing found 10,000 high-risk vulnerabilities in its first month using Claude Mythos Preview

Editorial illustration: digital compass over a code grid with highlighted vulnerable segments

Anthropic Project Glasswing brings together approximately 50 security partners using Claude Mythos Preview to scan critical software. In the first month, more than 10,000 high-risk and critical vulnerabilities were found, while open-source scanners discovered 6,202 flaws across one thousand projects with a 90.6 percent true-positive rate.

🟡 🛡️ Security May 23, 2026 · 4 min read

arXiv:2605.22786: LCGuard protects shared KV cache between agents in multi-agent systems from data leakage

Editorial illustration: boundary between two agent zones with a cryptographic shield around the KV cache

LCGuard is a new framework for protecting against data leakage in multi-agent systems that share a KV cache for efficiency. The paper by IBM Research and MIT researchers led by Sadie Asif presents the first formal model for a 'latent communication guard' approach, applicable to production agentic RAG systems where multiple agents share context through a common memory.

🟡 🛡️ Security May 23, 2026 · 4 min read

GitHub: npm 11.15.0 introduces staged publishing and three new install-time --allow flags for supply chain hardening

Editorial illustration: npm package in a staging compartment with a key and security filter

GitHub released npm CLI version 11.15.0, which introduces staged publishing — packages now require maintainer approval before becoming available for installation. A set of three new install-time flags (--allow-file, --allow-remote, --allow-directory) alongside the existing --allow-git was also introduced for granular control over dependency sources in the npm install command.

✨ Curiosities (1)

← Previous day