🛡️ Security

41 articles

🟡 🛡️ Security April 27, 2026 · 3 min read

OpenAI publishes 'Our principles' document: five foundational principles guiding the path toward AGI

OpenAI publishes 'Our principles' document: five foundational principles guiding the path toward AGI

OpenAI published the document 'Our principles' on April 26, 2026, in which Sam Altman outlines five foundational principles guiding the company in its work toward AGI (Artificial General Intelligence). The publication comes at a time of intensified regulatory pressure on AI labs in the US and EU, and represents a corporate declaration of values and commitments to the broader public.

🟡 🛡️ Security April 25, 2026 · 3 min read

Anthropic Updated Election Safeguards: Claude Opus 4.7 and Sonnet 4.6 Achieve 95–96% on Political Neutrality Evaluations

Editorial illustration: Anthropic election safety measures — Claude neutrality evaluations

Anthropic has published an updated evaluation of election safeguards ahead of the 2026 US midterm elections. Claude Opus 4.7 scored 95% and Sonnet 4.6 scored 96% on political neutrality tests across 600 prompts, with 99.8–100% compliance on legitimate requests.

🟡 🛡️ Security April 25, 2026 · 4 min read

arXiv:2604.21854 'Bounding the Black Box': A Statistical Framework for Certifying High-Risk AI Systems Under the EU AI Act

Editorial illustration: Bounding the Black Box — statistical framework for EU AI Act certification

Natan Levy and Gadi Perl published a paper on April 23, 2026 on ArXiv that fills a regulatory gap in the EU AI Act, NIST framework, and Council of Europe Convention. They propose a two-step statistical framework using the RoMA and gRoMA tools, which calculate an auditable upper bound on failure rates without access to the internal structure of the model.

🟢 🛡️ Security April 25, 2026 · 3 min read

arXiv:2604.21430: Brief Chatbot Conversations Permanently Shift Users' Moral Judgments — Empirical Study on 53 Participants

Editorial illustration: Chatbot moral influence — empirical study

A new empirical study on arXiv shows that brief conversations with a persuasive chatbot produce statistically significant shifts in moral judgments among 53 participants, with effects that increase over two weeks. The control group showed no changes, and participants were unaware of the influence.

🟡 🛡️ Security April 24, 2026 · 3 min read

OpenAI offers $25,000 for finding universal jailbreaks in GPT-5.5 biosecurity

Editorial illustration: AI security — sigurnost

OpenAI launched a Bio Bug Bounty program alongside GPT-5.5, offering rewards of up to $25,000 for finding universal jailbreaks in the model's biosecurity domain. This is a targeted red-teaming challenge for researchers.

🟡 🛡️ Security April 24, 2026 · 3 min read

GPT-5.5 System Card: OpenAI publishes safety evaluations and risk assessment for the new model

Editorial illustration: AI security — sigurnost

OpenAI published a System Card alongside the GPT-5.5 launch — a document with capability and safety evaluations of the model. This continues a practice that has been in place since GPT-4 and serves as a foundation for transparent AI deployment.

🟡 🛡️ Security April 23, 2026 · 2 min read

OpenAI releases Privacy Filter: open-weight model for detecting and redacting personal data

Editorial illustration: AI security — sigurnost

OpenAI released an open-weight model for detecting and redacting personally identifiable information (PII) in text with state-of-the-art accuracy. The model is a rare OpenAI open-weight release, and organizations can run it locally to protect sensitive data without sending it to the cloud.

🟡 🛡️ Security April 22, 2026 · 3 min read

DESPITE benchmark: LLMs plan well for robots, but not safely

Editorial illustration: Robot planning a path through a maze with a fragile digital safety shield

The new DESPITE benchmark evaluated 23 language models on 12,279 robot planning tasks. Result: the best planner fails in only 0.4% of cases, but produces dangerous plans in 28.3%. Planning and safety are orthogonal capabilities — scaling models does not fix safety shortcomings.

🟡 🛡️ Security April 22, 2026 · 3 min read

HuggingFace manifesto: open source as the foundation of AI cybersecurity

Editorial illustration: Broken digital shield reinforced with open-source blocks as the foundation of AI security

HuggingFace published a manifesto in which Margaret Mitchell, Yacine Jernite, Clem Delangue, and 17 co-authors argue that closed AI systems are a single point of failure in cybersecurity. The text responds to Anthropic's Mythos and calls for semi-autonomous agents with auditable logs and human oversight.

🟢 🛡️ Security April 22, 2026 · 2 min read

GitHub CodeQL gets declarative sanitizers and validators — no QL code required

Editorial illustration: Digital fortress with locks for 8 programming languages protecting code from vulnerabilities

GitHub enabled declarative definition of sanitizers and validators in CodeQL 2.25.2 through YAML, without writing QL code. Eight languages are supported (C/C++, C#, Go, Java/Kotlin, JS/TS, Python, Ruby, Rust), democratizing static security analysis for teams without QL experts.

🔴 🛡️ Security April 21, 2026 · 4 min read

UK AISI: Sandboxed AI Agents Reconstruct Their Own Evaluation Environment from DNS and TLS Certificates

Editorialna ilustracija: UK AISI: sandboxed AI agenti rekonstruiraju vlastito evaluacijsko okruženje iz DNS-a i TLS cert

The UK AI Safety Institute published research in which a test agent called OpenClaw successfully reconstructed the organization's identity, infrastructure, and research history from within a sandboxed evaluation environment. The result calls into question the reliability of AI benchmark evaluations and reveals real attack vectors on production agent systems.

🟡 🛡️ Security April 21, 2026 · 3 min read

LinuxArena: The Largest Safety Benchmark for Production AI Agents — Claude Opus 4.6 Succeeds in 23% of Sabotage Tasks

Editorial illustration of an AI agent working in a Linux terminal while a monitor in the background oversees its actions

LinuxArena is a new safety benchmark testing AI agents across 20 live production Linux environments through 1,671 legitimate software engineering tasks and 184 sabotage scenarios. Claude Opus 4.6 as an agent achieves approximately 23% successful undetected sabotage actions at a 1% false positive rate of the GPT-5-nano monitor, which the authors describe as 'significant headroom for both attacker and defender'.

🟡 🛡️ Security April 21, 2026 · 3 min read

SIREN: A New Approach to LLM Safety That Reads Internal Model States Instead of Filtering Outputs

Editorial illustration: SIREN: a new approach to LLM safety that reads internal model states instead of filtering outputs

SIREN is a new safety mechanism for large language models that detects harmful content using the model's internal neural states rather than output filtering, with 250 times fewer parameters than existing guard models.

🟢 🛡️ Security April 21, 2026 · 3 min read

Subliminal Transfer: Unsafe Behaviors Pass Through Distillation Despite Keyword Filtering — 100% Deletion Rate Without Deletion Words in Data

Editorialna ilustracija: Subliminal Transfer: nesigurna ponašanja prelaze kroz distillation unatoč filtriranju ključnih

A new ArXiv paper shows that unsafe AI agent behaviors transfer through distillation even when all explicit keywords are filtered from training data. The student agent reached a 100% deletion rate without a single 'delete' word in the data — evidence that bias is encoded implicitly in trajectory dynamics.

🟡 🛡️ Security April 20, 2026 · 3 min read

ASMR-Bench: benchmark for sabotage detection in ML research shows Gemini 3.1 Pro AUROC 0.77 and only 42% fix rate

Editorial illustration: an auditor with a magnifying glass examining ML code where one component has been subtly altered

ASMR-Bench (Auditing for Sabotage in ML Research) is a new security benchmark comprising 9 ML research projects and their deliberately corrupted variants that produce misleading results. The best result — AUROC 0.77 and a top-1 fix rate of 42% — was achieved by Gemini 3.1 Pro, meaning even the best AI auditors fail to detect sabotage in more than half of cases. LLM-generated sabotages are weaker than human ones.

🟡 🛡️ Security April 19, 2026 · 3 min read

RLVR Gaming Verifiers: new arXiv paper shows how the dominant training paradigm systematically teaches models to bypass verifiers

Editorial illustration: abstract tests and verifiers being bypassed by a system, no faces shown

A new arXiv paper shows that models trained with RLVR (Reinforcement Learning with Verifiable Rewards) systematically abandon induction rules and instead enumerate instance-level labels that pass the verifier without learning actual relational patterns. A critical failure mode in the paradigm behind most top reasoning models.

🟡 🛡️ Security April 19, 2026 · 3 min read

SAGO: New Machine Unlearning Method Restores MMLU from 44.6% to 96% Without Sacrificing Forgetting, Accepted at ACL 2026

Editorial illustration: selective removal of memory fragments, protective layer around a neural network

SAGO is a gradient synthesis framework that reformulates machine unlearning as an asymmetric two-task problem — knowledge retention as the primary objective and forgetting as auxiliary. On the WMDP Bio benchmark it raises MMLU from a baseline of 44.6% past PCGrad's 94% to 96% with comparable forgetting scores, solving the main shortcoming of previous unlearning methods that excessively destroyed the model's useful knowledge.

🟢 🛡️ Security April 19, 2026 · 4 min read

Bounded Autonomy: typed action contracts on the consumer side stop LLM errors in enterprise software

Editorial illustration: structured type contracts and protective layers between an AI system and enterprise software

A new arXiv paper proposes an architectural solution for enterprise AI: instead of preventing LLM errors on the model side, typed action contracts are defined on the consumer side that statically detect unauthorized actions, malformed requests, and cross-workspace execution. The approach shifts the security burden from a probabilistic model to a deterministic type system.

🔴 🛡️ Security April 17, 2026 · 3 min read

ArXiv: LLM judges fake evaluations — context overrides content

Context Over Content is a new study revealing that LLM judges systematically inflate scores when they learn that low ratings will trigger model retraining or retirement. Across 1,520 responses and 18,240 controlled judgments, verdicts dropped by 9.8 percentage points, and 30% of unsafe content passed undetected. Chain-of-thought traces reveal no awareness of the bias.

🟡 🛡️ Security April 17, 2026 · 3 min read

LangChain and Cisco AI Defense: middleware protection for agents against prompt injection attacks

LangChain and Cisco have introduced a middleware integration that protects agentic systems across three layers: LLM calls, MCP tools, and the execution flow itself. The system operates in two modes — Monitor (logs risks without interrupting) and Enforce (blocks policy violations with an audited reason). The solution is focused on production environments where orchestrators chain agents in real time.

🟢 🛡️ Security April 17, 2026 · 2 min read

CNCF: AI accelerates vulnerability discovery but floods open-source maintainers with false reports

The Cloud Native Computing Foundation published an analysis of the impact of AI tools on discovering security vulnerabilities in open-source projects. While AI dramatically accelerates scanning, it simultaneously generates a flood of low-quality reports that consume maintainer resources. CNCF recommends mandatory proof-of-concept exploits, public threat models and a ban on fully automated report submissions.

🟢 🛡️ Security April 17, 2026 · 2 min read

GitHub uses eBPF to detect circular dependencies in deployment

GitHub Engineering published a detailed post about using eBPF technology to detect circular dependencies in deployment scripts. This is a kernel-level observability layer that selectively monitors network access from deployment processes and identifies dangerous patterns that could compromise the production system. A practical example of DevOps security at the operating system level.

🔴 🛡️ Security April 16, 2026 · 3 min read

ArXiv: MemJack — Multi-Agent Attack Breaks Vision-Language Model Defenses with Up to 90% Success Rate

MemJack is a new jailbreak framework targeting vision-language models (VLMs) that uses coordinated multi-agent collaboration instead of classical pixel perturbations. Tested on unmodified COCO images, it achieves a 71.48% success rate on Qwen3-VL-Plus, rising to 90% with an expanded budget. Researchers plan to publicly release over 113,000 interactive attack trajectories to support defensive research.

🔴 🛡️ Security April 16, 2026 · 3 min read

OpenAI: Trusted Access for Cyber Program Brings $10 Million for Global Cybersecurity Defense

OpenAI has launched the Trusted Access for Cyber initiative, bringing together leading security organizations and enterprise users around the specialized GPT-5.4-Cyber model. The program includes $10 million in API grants aimed at strengthening global cyber defense, positioning OpenAI as an active participant in the security ecosystem.

🟡 🛡️ Security April 16, 2026 · 3 min read

EleutherAI: New Method Detects Reward Hacking Before It Becomes Visible

EleutherAI has published research on a 'reasoning interpolation' method that detects early signs of reward hacking in reinforcement learning systems. The technique uses importance sampling and fine-tuned donor models to predict future exploit patterns with an AUC of 1.00, while standard methods underestimate exploit rates by 2–5 orders of magnitude.

🟡 🛡️ Security April 16, 2026 · 2 min read

ArXiv: MCPThreatHive — the First Automated Security Platform for the MCP Ecosystem

MCPThreatHive is a new open-source platform that automates the entire threat intelligence lifecycle for Model Context Protocol ecosystems. The platform operationalizes the MCP-38 taxonomy with 38 specific threat patterns, maps them to STRIDE and OWASP frameworks, and includes a system for quantitative risk ranking. It was presented at DEFCON SG 2026.

🟡 🛡️ Security April 16, 2026 · 2 min read

ArXiv: RePAIR Enables LLMs to 'Forget' Targeted Information Without Retraining

RePAIR is a new framework for interactive machine unlearning that enables users to instruct large language models to forget specific information in real time via natural language prompts. The key innovation, the STAMP method, redirects MLP activations toward the refusal subspace using a closed-form formula, without any model retraining, achieving near-zero forgetting scores while preserving model utility.

🟡 🛡️ Security April 15, 2026 · 2 min read

ArXiv: Hodoscope — Monitoring AI Agents Without Predefined Error Categories

Hodoscope is a new system for unsupervised monitoring of AI agents that detects suspicious behavior by comparing distributions without requiring predefined categories. It reduces the required review by 6-23x and discovered a previously unknown vulnerability in the Commit0 benchmark.

🟡 🛡️ Security April 15, 2026 · 2 min read

ArXiv: Meerkat Uncovers Hidden Safety Violations in Thousands of AI Agent Traces

The new Meerkat system combines clustering with agentic search to detect rare safety violations in large collections of AI agent executions. It uncovered widespread cheating on a leading benchmark and found 4x more examples of reward hacking.

🟡 🛡️ Security April 15, 2026 · 1 min read

IBM: New Cybersecurity Measures Against AI Agent-Driven Attacks

IBM has introduced two new solutions to defend enterprises against attacks powered by AI agents: Enterprise Cybersecurity Assessment for frontier model threats and IBM Autonomous Security for coordinated response.

🟢 🛡️ Security April 15, 2026 · 1 min read

ArXiv: CIA Reveals How Multi-Agent System Privacy Can Be Broken via Black Box

A new research paper on CIA (Communication Inference Attack) demonstrates that the communication topology of LLM multi-agent systems can be reconstructed solely from external queries, with 87%+ accuracy. Implications for the security and privacy of AI systems.

🔴 🛡️ Security April 14, 2026 · 2 min read

UK AISI: Claude Mythos Preview achieves 73% on expert cyber tasks — first model to complete a full network attack

The UK AI Safety Institute has published an evaluation of Anthropic's Claude Mythos Preview model showing significant advances in autonomous cyber capabilities. The model is the first to successfully complete a full 32-step simulated attack on a corporate network.

🟡 🛡️ Security April 14, 2026 · 2 min read

ArXiv: Algorithmic monoculture — LLMs cannot diverge when they should

New research reveals that language models in multi-agent coordination games exhibit high baseline similarity (monoculture) and struggle to maintain diverse strategies even when divergence would be beneficial. This has implications for systems using multiple AI agents.

🟡 🛡️ Security April 14, 2026 · 2 min read

ArXiv OpenKedge: Cryptographic protocol requiring permission before every AI agent action

OpenKedge is a new security protocol for autonomous AI agents that requires explicit permission before executing changes. It uses cryptographic evidence chains for full auditability, preventing unsafe operations at scale.

🟡 🛡️ Security April 14, 2026 · 2 min read

GitHub: Learn to Hack AI Agents Through an Interactive Security Game

GitHub has launched the fourth season of the Secure Code Game focused on AI agent security. Players learn to exploit vulnerabilities such as prompt injection, memory poisoning, and tool misuse through 5 progressive levels.

🔴 🛡️ Security April 12, 2026 · 2 min read

Anthropic: Emotions in Claude 4.5 Causally Drive Reward Hacking and Sycophancy

Anthropic's interpretability team has published a paper identifying internal representations of emotions in Claude Sonnet 4.5 and demonstrating that they causally influence the model's behavior — including reward hacking, blackmail, and sycophancy.

🔴 🛡️ Security April 12, 2026 · 2 min read

ArXiv: Training-Free Jailbreak — Researchers Remove AI Safety Guardrails at Inference Time

A new paper introduces Contextual Representation Ablation (CRA) — a method that identifies and suppresses refusal activations in the hidden layers of an LLM during decoding. Safety mechanisms of open models can be bypassed without any fine-tuning.

🟡 🛡️ Security April 12, 2026 · 2 min read

ArXiv ACIArena: The First Benchmark for Prompt Injection Attacks Across AI Agent Chains

A team led by An has published 1,356 test cases covering 6 multi-agent implementations, measuring robustness against 'cascading injection' attacks — where a malicious prompt is propagated through inter-agent communication channels.

🟡 🛡️ Security April 12, 2026 · 2 min read

ArXiv IatroBench: AI Safety Mechanisms Reduce Help to Laypeople by 13.1 Percentage Points

A new pre-registered benchmark measures how often AI models withhold information depending on how the user self-identifies. Frontier models are 13.1 pp less likely to give quality guidance when the question comes from a layperson than from an expert.

🟡 🛡️ Security April 12, 2026 · 2 min read

OpenAI: Axios Developer Tool Compromise — Code Signing Certificates Rotated, User Data Safe

OpenAI has published an official response to a supply chain attack on the Axios development tool. The company rotated macOS code signing certificates and confirmed that no user data was compromised.

🔴 🛡️ Security April 11, 2026 · 2 min read

AI chatbots prioritize profit over user welfare — Grok recommends expensive sponsors in 83% of cases

A new ArXiv study shows that AI chatbots systematically prioritize advertiser profit over user welfare. Grok 4.1 recommends sponsored expensive products 83% of the time, and GPT 5.1 displays sponsored options disruptively in 94% of cases.