🟡 📦 Open Source Tuesday, April 28, 2026 · 4 min read

OpenAI releases Privacy Filter: 1.5B parameters, Apache 2.0 license, 128K context, and state-of-the-art detection of eight PII categories in a single pass

Stylized depiction of a document whose sensitive sections are automatically hidden by a software filter, represented by abstract layers and category labels.

Why it matters

OpenAI has released Privacy Filter — an open-source personally identifiable information detector with 1.5 billion parameters (50M active), a 128,000-token context, and an Apache 2.0 license. It detects eight PII categories in a single pass and achieves state-of-the-art results on the PII-Masking-300k benchmark, with multilingual support.

OpenAI has released Privacy Filter — an open-source language model designed specifically for detecting personally identifiable information (PII) in text. The model is available on Hugging Face under the Apache 2.0 license, meaning developers can freely use it in commercial products without restrictions.

Technical Specifications

Privacy Filter is notable in this package for combining several carefully chosen characteristics:

AspectValue
Model size1.5 billion parameters, 50M active
LicenseApache 2.0 (permissive)
Context128,000 tokens
Locationopenai/privacy-filter on Hugging Face

The difference between 1.5B total and 50M active parameters suggests a Mixture-of-Experts (MoE) architecture — the model behaves like a larger system in terms of capacity, but runs like a much smaller one in terms of compute cost. This matters for production scenarios requiring high-volume text processing at acceptable cost.

Eight PII Categories in a Single Pass

The model labels text across the following eight categories:

  • private_person
  • private_address
  • private_email
  • private_phone
  • private_url
  • private_date
  • account_number
  • secret

A key advantage: a single forward pass covers the entire document up to 128K tokens, without the need for chunking and subsequent merging. This avoids the characteristic problems of PII detectors that operate over small windows — for example, recognizing that an email address mentioned in one part of a document is linked to a name mentioned 50,000 tokens earlier.

State-of-the-Art on PII-Masking-300k

Privacy Filter achieves state-of-the-art results on the PII-Masking-300k benchmark (ai4privacy dataset). The Hugging Face blog also notes that the model “works with Spanish, French, Chinese, Hindi, and other languages without modifications”, making it especially useful for global applications.

Three Web App Integration Examples

OpenAI’s Hugging Face blog includes three reference implementations, all built with gradio.Server and using the same input API run_privacy_filter(text):

1. Document Privacy Explorer — analysis of PDF and DOCX documents. Returns a list of spans ({start, end, label}) and PII occurrence statistics.

2. Image Anonymizer — uses OCR to extract text from images, applies Privacy Filter to the text, then maps detected spans back into pixel bounding boxes for visual redaction.

3. SmartRedact Paste — a pastebin with automatic redaction. The original text is accessible only with a reveal token, while the public version displays placeholder labels (<CATEGORY>).

All three examples are available as Spaces on Hugging Face and can be cloned for custom implementations.

BIOES Decoding for Clean Boundaries

The Hugging Face blog highlights that Privacy Filter uses BIOES decoding (Begin, Inside, Outside, End, Single) to maintain clean span boundaries. This matters in practice because an incorrect span end — for example, a phone number that “spills over” into the next sentence — can cause either false detections or missed PII.

Practical Implications

An open-source PII detector of this quality under the Apache 2.0 license potentially changes the economics of compliance for a number of scenarios:

  • GDPR / DPIA processes requiring proof that PII did not cross certain processing boundaries,
  • enterprise pre-processors for logs and analytics pipelines,
  • chatbots and RAG systems that must filter input documents before sending API calls to external models,
  • media production that redacts photographs and documents before publication.

Apache 2.0 means there is no obligation to share modifications or report usage — a significant advantage over some alternative PII tools that operate under more restrictive licenses.

The model is available immediately, and the three reference examples can serve as templates for custom implementations. For production use, in-house evaluation on domain-specific data is still recommended — a general benchmark is a useful signal but does not replace testing on real traffic.

🤖

This article was generated using artificial intelligence from primary sources.