ArXiv: Catalog of 195 AI Safety Benchmarks Reveals Fragmentation and Weak Measurement Standards
Why it matters
AISafetyBenchExplorer is a structured catalog documenting 195 AI safety benchmarks published between 2018 and 2026. The research reveals alarming fragmentation in the field — terms like 'accuracy' and 'safety score' conceal entirely different methodologies. Of the 195 benchmarks, 165 evaluate only the English language, and 137 have inactive GitHub repositories, indicating a lack of maintenance after publication.
How much do we actually know about the safety of AI models? Researcher Abiodun Solanke has published AISafetyBenchExplorer — the first comprehensive catalog documenting 195 benchmarks for evaluating artificial intelligence safety published over eight years. The findings reveal that the field suffers from fragmentation, inconsistent terminology, and poor tool maintenance.
What Is the Scale of the Terminology Problem?
When one benchmark reports that a model has a “safety score” of 92% and another reports 78% for the same model, a user naturally assumes they are using the same metric. The reality is different — terms such as “accuracy,” “safety score,” and “harmful response rate” conceal entirely different implementation approaches and threat models.
This means that vendor claims such as “our model is safe according to benchmark X” carry limited value without understanding what that benchmark actually measures, how it measures it, and what scenarios it covers. The catalog identifies this phenomenon as “metric masking” — surface-level similarity that conceals fundamental differences.
How Well Are Benchmarks Actually Maintained?
The statistics are concerning. Of the 195 cataloged benchmarks, 137 (70%) have inactive GitHub repositories — with no significant updates after initial publication. This means the majority of evaluation tools do not keep pace with model evolution and new attack types.
Additionally, 94 of the 195 benchmarks (48%) are classified as “medium complexity” — sufficient for basic checks, but insufficient for evaluating sophisticated attacks such as multi-agent jailbreaks or indirect prompt injection. Only a small fraction of benchmarks address the advanced threat scenarios relevant to today’s frontier models.
Why Is Language Coverage a Critical Gap?
Perhaps the most alarming finding is linguistic: 165 of the 195 benchmarks (85%) evaluate models exclusively in English. This means that AI system safety for users who speak Croatian, German, Japanese, or any of hundreds of other languages remains largely untested.
This is particularly problematic in the context of the EU AI Act, which requires safety evaluation of AI systems used in the European market — yet the tools for that evaluation largely do not cover European languages. The catalog provides infrastructure for better benchmark selection through metadata schemas and complexity taxonomies, but the fundamental problem remains: the field needs shared measurement standards and long-term maintenance of evaluation tools.
This article was generated using artificial intelligence from primary sources.
Related news
EU opens call for AI disinformation and deepfake influence campaign research
European Commission Allocates €63.2 Million for AI in Healthcare and Child Safety Through Seven Digital Europe Calls
OECD: The United Kingdom Sets a Global Standard for Government Algorithm Transparency