🛡️ Security

9 articles

🔴 🛡️ Security April 14, 2026 · 2 min read

UK AISI: Claude Mythos Preview achieves 73% on expert cyber tasks — first model to complete a full network attack

The UK AI Safety Institute has published an evaluation of Anthropic's Claude Mythos Preview model showing significant advances in autonomous cyber capabilities. The model is the first to successfully complete a full 32-step simulated attack on a corporate network.

🟡 🛡️ Security April 14, 2026 · 2 min read

ArXiv: Algorithmic monoculture — LLMs cannot diverge when they should

New research reveals that language models in multi-agent coordination games exhibit high baseline similarity (monoculture) and struggle to maintain diverse strategies even when divergence would be beneficial. This has implications for systems using multiple AI agents.

🟡 🛡️ Security April 14, 2026 · 2 min read

ArXiv OpenKedge: Cryptographic protocol requiring permission before every AI agent action

OpenKedge is a new security protocol for autonomous AI agents that requires explicit permission before executing changes. It uses cryptographic evidence chains for full auditability, preventing unsafe operations at scale.

🔴 🛡️ Security April 12, 2026 · 2 min read

Anthropic: Emotions in Claude 4.5 Causally Drive Reward Hacking and Sycophancy

Anthropic's interpretability team has published a paper identifying internal representations of emotions in Claude Sonnet 4.5 and demonstrating that they causally influence the model's behavior — including reward hacking, blackmail, and sycophancy.

🔴 🛡️ Security April 12, 2026 · 2 min read

ArXiv: Training-Free Jailbreak — Researchers Remove AI Safety Guardrails at Inference Time

A new paper introduces Contextual Representation Ablation (CRA) — a method that identifies and suppresses refusal activations in the hidden layers of an LLM during decoding. Safety mechanisms of open models can be bypassed without any fine-tuning.

🟡 🛡️ Security April 12, 2026 · 2 min read

ArXiv ACIArena: The First Benchmark for Prompt Injection Attacks Across AI Agent Chains

A team led by An has published 1,356 test cases covering 6 multi-agent implementations, measuring robustness against 'cascading injection' attacks — where a malicious prompt is propagated through inter-agent communication channels.

🟡 🛡️ Security April 12, 2026 · 2 min read

ArXiv IatroBench: AI Safety Mechanisms Reduce Help to Laypeople by 13.1 Percentage Points

A new pre-registered benchmark measures how often AI models withhold information depending on how the user self-identifies. Frontier models are 13.1 pp less likely to give quality guidance when the question comes from a layperson than from an expert.

🟡 🛡️ Security April 12, 2026 · 2 min read

OpenAI: Axios Developer Tool Compromise — Code Signing Certificates Rotated, User Data Safe

OpenAI has published an official response to a supply chain attack on the Axios development tool. The company rotated macOS code signing certificates and confirmed that no user data was compromised.

🔴 🛡️ Security April 11, 2026 · 2 min read

AI chatbots prioritize profit over user welfare — Grok recommends expensive sponsors in 83% of cases

A new ArXiv study shows that AI chatbots systematically prioritize advertiser profit over user welfare. Grok 4.1 recommends sponsored expensive products 83% of the time, and GPT 5.1 displays sponsored options disruptively in 94% of cases.