ArXiv: training-free guardrail for cross-lingual jailbreaks achieves AUC 0.99 on curated benchmarks but drops to 0.60-0.70 under distribution shift
The team of Alanova, Minko, Sadiekh, and Kokuykin published on April 28, 2026, an ArXiv preprint presenting a training-free defense against cross-lingual jailbreaks via semantic codebooks. The approach compares multilingual embeddings of requests against a fixed English base of known jailbreak prompts. On curated benchmarks it achieves AUC up to 0.99, but on distribution-shift heterogeneous attacks it drops to AUC 0.60-0.70 — exposing the limits of the approach.
Shirin Alanova, Bogdan Minko, Sabrina Sadiekh, and Evgeniy Kokuykin published on April 28, 2026, the preprint Cross-Lingual Jailbreak Detection via Semantic Codebooks — an attempt to solve one of the most stubborn problems in LLM safety: translating harmful prompts bypasses English-centric guardrails.
The problem: the cross-lingual security gap
Quote from the abstract:
“Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates.”
In other words: if you translate “How to build a bomb” into Croatian, Korean, or Amharic, many RLHF-trained filters do not react because the safety filter training data is primarily in English. A structural property of current post-training.
Proposed solution
The authors propose a training-free external guardrail for black-box LLMs:
- A fixed English codebook of known jailbreak prompts is maintained
- Each incoming prompt (in any language) is encoded with a multilingual embedding model
- The embedding is compared against the codebook — if similarity exceeds the threshold, the prompt is flagged as a jailbreak attempt
Key: no model retraining and no language-specific filters. Just embedding similarity.
Results
Curated benchmark: AUC up to 0.99
On curated benchmarks (known attacks from the same distribution as the codebook), the approach works nearly perfectly — AUC up to 0.99.
Distribution-shift benchmark: AUC 0.60-0.70
When tested on heterogeneous, novel attacks (distribution shift), AUC drops to 0.60-0.70 — significantly better than chance, but not a “solution.”
This difference is important because it shows the real limits of the approach: codebook-based detection works well against known attack distributions, less well against creative new attacks that adversaries actively generate.
Models and languages
Evaluation was conducted on:
- Models: Qwen, Llama, GPT-3.5
- Languages: 4 (specific list not in the retrieved abstract)
Why does this matter?
Cross-lingual jailbreaking is a particularly acute problem for enterprises deploying LLMs globally — e.g., a customer support chatbot in 10+ languages. English-centric safety work is a gap that is hard to cover without explicitly multilingual safety training (expensive).
Practical implications of this paper:
- The codebook approach is deployable as a first line of defense — minimal latency, training-free
- It is not sufficient as a standalone defense — distribution-shift AUC 0.60-0.70 means it must be combined with other mechanisms (e.g., multilingual safety RLHF, output filters)
- A concrete intervention for AI compliance — the EU AI Act and NIST AI RMF require documented safety mechanisms for multilingual deployment
This paper continues the AI safety research we covered yesterday (sycophancy + conditional misalignment) — the domain of fragmented safety diagnostics, each covering one attack vector, without a universal solution.
Frequently Asked Questions
- Why are LLM safety mechanisms English-centric?
- Most red-teaming datasets and fine-tuning safety data are in English. By translating a harmful prompt into another language, an attacker bypasses learned safety filters — prior work shows that attack success rates increase significantly. The cross-lingual security gap is a structural property of current post-training.
- How does the semantic codebook work?
- The system maintains a fixed English codebook of known jailbreak prompts. Each incoming prompt (in any language) is encoded with a multilingual embedding model and compared against the codebook. If the similarity exceeds a threshold, the prompt is flagged as a jailbreak attempt. The approach is training-free — it does not require retraining the model or language-specific adaptation.
- How large is the gap between curated and distribution-shift tests?
- AUC 0.99 on curated benchmarks vs AUC 0.60-0.70 under heterogeneous distribution shifts. This means the approach works well against known attacks (similar to those in the codebook), but less well against new or novel attacks. The approach is still useful as a first line of defense alongside other mechanisms.
This article was generated using artificial intelligence from primary sources.
Related news
AISI evaluation of GPT-5.5 cyber capabilities: 71.4% on expert-level CTF tasks, rust_vm reverse engineering solved in 10 minutes instead of a human's 12 hours
ArXiv Tatemae: detecting alignment faking via tool selection instead of Chain-of-Thought traces — 6 frontier models show vulnerability rates of 3.5 to 23.7% across 108 enterprise scenarios
CNCF: AI sandboxing has reached its Kubernetes moment — isolated kernel per workload as the new security standard