What is AISafetyBenchExplorer?

It is a structured catalog of 195 AI safety benchmarks from the period 2018–2026 that enables comparison of methodologies, metrics, and the scope of evaluation tools for AI system safety.

Why is benchmark fragmentation a problem?

Different benchmarks use the same terms (such as 'safety score') to mean entirely different things, making it impossible to compare results across models and making it harder for regulators to set unified standards.

How many benchmarks cover languages other than English?

Only 30 of the 195 benchmarks (15%) evaluate models in languages other than English, meaning AI system safety for the majority of the world's languages remains untested.

ArXiv: Catalog of 195 AI Safety Benchmarks Reveals Fragmentation and Weak Measurement Standards

How much do we actually know about the safety of AI models? Researcher Abiodun Solanke has published AISafetyBenchExplorer — the first comprehensive catalog documenting 195 benchmarks for evaluating artificial intelligence safety published over eight years. The findings reveal that the field suffers from fragmentation, inconsistent terminology, and poor tool maintenance.

What Is the Scale of the Terminology Problem?

When one benchmark reports that a model has a “safety score” of 92% and another reports 78% for the same model, a user naturally assumes they are using the same metric. The reality is different — terms such as “accuracy,” “safety score,” and “harmful response rate” conceal entirely different implementation approaches and threat models.

This means that vendor claims such as “our model is safe according to benchmark X” carry limited value without understanding what that benchmark actually measures, how it measures it, and what scenarios it covers. The catalog identifies this phenomenon as “metric masking” — surface-level similarity that conceals fundamental differences.

How Well Are Benchmarks Actually Maintained?

The statistics are concerning. Of the 195 cataloged benchmarks, 137 (70%) have inactive GitHub repositories — with no significant updates after initial publication. This means the majority of evaluation tools do not keep pace with model evolution and new attack types.

Additionally, 94 of the 195 benchmarks (48%) are classified as “medium complexity” — sufficient for basic checks, but insufficient for evaluating sophisticated attacks such as multi-agent jailbreaks or indirect prompt injection. Only a small fraction of benchmarks address the advanced threat scenarios relevant to today’s frontier models.

Why Is Language Coverage a Critical Gap?

Perhaps the most alarming finding is linguistic: 165 of the 195 benchmarks (85%) evaluate models exclusively in English. This means that AI system safety for users who speak Croatian, German, Japanese, or any of hundreds of other languages remains largely untested.

This is particularly problematic in the context of the EU AI Act, which requires safety evaluation of AI systems used in the European market — yet the tools for that evaluation largely do not cover European languages. The catalog provides infrastructure for better benchmark selection through metadata schemas and complexity taxonomies, but the fundamental problem remains: the field needs shared measurement standards and long-term maintenance of evaluation tools.

ArXiv: Catalog of 195 AI Safety Benchmarks Reveals Fragmentation and Weak Measurement Standards

What Is the Scale of the Terminology Problem?

How Well Are Benchmarks Actually Maintained?

Why Is Language Coverage a Critical Gap?

Sources

Related news