AWS: PII Redaction for 400M Docs — 95% Accuracy

Huntington Bank used an AWS stack (Textract, SageMaker, Step Functions) to redact personal data from more than 400 million documents at over 95% accuracy, cutting the original cost estimate to just 5% and compressing timelines from years to months.

Why PII redaction became a burning issue

PII (Personally Identifiable Information) — personal data that uniquely identifies a natural person, such as name, Social Security number, or account details — appears in millions of old paper and digital documents across the banking sector. Regulatory frameworks such as GDPR and the US GLBA require its removal before any further processing or sharing. Huntington Bank, one of the leading regional banks in the US, faced this task at industrial scale: more than 400 million documents that needed processing without compromising content integrity.

How AWS solved the problem at a fraction of the projected budget

Huntington Bank achieved a redaction accuracy of more than 95% while processing around 10 million documents per day. For comparison, manual or semi-automated approaches would typically require multi-year projects and far larger teams.

The stack that made this possible combines four AWS services: Amazon Textract for text extraction from scanned documents, SageMaker for ML detection of PII entities, Step Functions for workflow orchestration, and Lambda for serverless execution of steps, while DataSync handles secure file transfers between layers.

The result is doubly impressive on the financial side: the final project cost came to just 5% of the original estimate, and timelines were compressed from the planned years to a few months — meaning the project was delivered 20 times cheaper than a classical approach would have predicted.

Lessons for the broader industry

The Huntington Bank case demonstrates that the AWS PII redaction pipeline is not a lab demonstration — it works in production on nearly half a billion documents with measurable results. Accuracy of 95%+ is not perfect, but it is sufficient for regulatory compliance when combined with targeted human review of high-risk categories.

For financial institutions and healthcare organizations sitting on vast archives of old documents, this model offers a clear path: automated extraction and detection, ML entity classification, and serverless orchestration — without building infrastructure from scratch.

Frequently Asked Questions

What is PII and why must banks remove it?

PII (Personally Identifiable Information — personal data that allows identification of a natural person, such as name, SSN, or account number) is subject to strict regulations; without redaction, banks cannot further process or share documents.

How much did the project cost compared with the original estimate?

The final cost was only 5% of the original estimate, and timelines were compressed from the planned years to just a few months.

AWS: Huntington Bank redacted PII from 400 million documents at 95% accuracy

Why PII redaction became a burning issue

How AWS solved the problem at a fraction of the projected budget

Lessons for the broader industry

Frequently Asked Questions

Sources

Related news