What is OpenAI Privacy Filter?

An open-source language model for detecting personally identifiable information (PII) in text. It labels text across eight PII categories in a single forward pass with a 128,000-token context. Available under the Apache 2.0 license on Hugging Face.

What does '1.5B parameters, 50M active' mean?

The total model size is 1.5 billion parameters, but only 50 million are activated during each pass. This is a typical pattern for Mixture-of-Experts architectures and means the model is significantly more efficient to run than its total size implies.

Which PII categories does the model detect?

Eight: private_person, private_address, private_email, private_phone, private_url, private_date, account_number, and secret. It also covers multiple languages — Spanish, French, Chinese, Hindi, and others — without additional modifications.

How can it be integrated into web applications?

The Hugging Face blog demonstrates three examples through gradio.Server: Document Privacy Explorer (PDF/DOCX), Image Anonymizer (OCR + pixel boxes), and SmartRedact Paste (pastebin with automatic redaction). All use the same API pattern run_privacy_filter(text).

OpenAI Privacy Filter 1.5B: Apache 2.0 PII detector with 128K context

OpenAI has released Privacy Filter — an open-source language model designed specifically for detecting personally identifiable information (PII) in text. The model is available on Hugging Face under the Apache 2.0 license, meaning developers can freely use it in commercial products without restrictions.

Technical Specifications

Privacy Filter is notable in this package for combining several carefully chosen characteristics:

Aspect	Value
Model size	1.5 billion parameters, 50M active
License	Apache 2.0 (permissive)
Context	128,000 tokens
Location	`openai/privacy-filter` on Hugging Face

The difference between 1.5B total and 50M active parameters suggests a Mixture-of-Experts (MoE) architecture — the model behaves like a larger system in terms of capacity, but runs like a much smaller one in terms of compute cost. This matters for production scenarios requiring high-volume text processing at acceptable cost.

Eight PII Categories in a Single Pass

The model labels text across the following eight categories:

private_person
private_address
private_email
private_phone
private_url
private_date
account_number
secret

A key advantage: a single forward pass covers the entire document up to 128K tokens, without the need for chunking and subsequent merging. This avoids the characteristic problems of PII detectors that operate over small windows — for example, recognizing that an email address mentioned in one part of a document is linked to a name mentioned 50,000 tokens earlier.

State-of-the-Art on PII-Masking-300k

Privacy Filter achieves state-of-the-art results on the PII-Masking-300k benchmark (ai4privacy dataset). The Hugging Face blog also notes that the model “works with Spanish, French, Chinese, Hindi, and other languages without modifications”, making it especially useful for global applications.

Three Web App Integration Examples

OpenAI’s Hugging Face blog includes three reference implementations, all built with gradio.Server and using the same input API run_privacy_filter(text):

1. Document Privacy Explorer — analysis of PDF and DOCX documents. Returns a list of spans ({start, end, label}) and PII occurrence statistics.

2. Image Anonymizer — uses OCR to extract text from images, applies Privacy Filter to the text, then maps detected spans back into pixel bounding boxes for visual redaction.

3. SmartRedact Paste — a pastebin with automatic redaction. The original text is accessible only with a reveal token, while the public version displays placeholder labels (<CATEGORY>).

All three examples are available as Spaces on Hugging Face and can be cloned for custom implementations.

BIOES Decoding for Clean Boundaries

The Hugging Face blog highlights that Privacy Filter uses BIOES decoding (Begin, Inside, Outside, End, Single) to maintain clean span boundaries. This matters in practice because an incorrect span end — for example, a phone number that “spills over” into the next sentence — can cause either false detections or missed PII.

Practical Implications

An open-source PII detector of this quality under the Apache 2.0 license potentially changes the economics of compliance for a number of scenarios:

GDPR / DPIA processes requiring proof that PII did not cross certain processing boundaries,
enterprise pre-processors for logs and analytics pipelines,
chatbots and RAG systems that must filter input documents before sending API calls to external models,
media production that redacts photographs and documents before publication.

Apache 2.0 means there is no obligation to share modifications or report usage — a significant advantage over some alternative PII tools that operate under more restrictive licenses.

The model is available immediately, and the three reference examples can serve as templates for custom implementations. For production use, in-house evaluation on domain-specific data is still recommended — a general benchmark is a useful signal but does not replace testing on real traffic.

OpenAI releases Privacy Filter: 1.5B parameters, Apache 2.0 license, 128K context, and state-of-the-art detection of eight PII categories in a single pass