Comparing John Snow Labs’ Medical Text De-identification with Microsoft Presidio

17.06.2025

David Talby

Chief technology officer at John Snow Labs

As healthcare organizations increasingly rely on unstructured data like clinical notes, pathology reports, and discharge summaries, de-identifying patient information becomes mission-critical. Whether for research, AI training, or compliance, healthcare providers must ensure Protected Health Information (PHI) is removed at scale and with precision.

Two solutions often considered for this task are John Snow Labs’ Medical Text De-identification and Microsoft Presidio. While both are powerful tools for identifying and redacting sensitive data, they serve very different use cases — and their effectiveness in healthcare settings diverges sharply.

What is Microsoft Presidio?

Microsoft Presidio is an open-source tool designed for detecting and anonymizing Personally Identifiable Information (PII) in text and structured data. It comes with pre-configured recognizers for common entities like names, phone numbers, and credit card numbers, and it supports integration with various NLP backends.

Presidio’s strengths lie in its:

Language-agnostic architecture with support for different NLP engines (spaCy, transformers, etc.)
Customizability through user-defined recognizers
Ease of integration into enterprise pipelines

However, it was designed for general-purpose PII detection, and not specifically for clinical or biomedical contexts.

What is John Snow Labs’ Medical Text De-identification?

John Snow Labs’ solution is part of its Healthcare NLP library, built specifically to handle the unique requirements of clinical text. It leverages domain-specific language models, context-aware PHI detection, and state-of-the-art NER pipelines tuned for compliance with HIPAA, GDPR, and other global healthcare regulations.

Key features include:

Over 50 PHI entity types tailored to clinical documents (e.g., “Medical Record Number,” “Device Serial Number,” “Date of Procedure”)
Support for multiple redaction policies (masking, obfuscation, dummy replacement, encryption)
State-of-the-art accuracy, even in edge cases like shorthand, typos, or OCR-transcribed notes
Seamless integration with Spark NLP for scalable, distributed processing

Benchmarking Accuracy: A Peer-reviewed Comparison

Accuracy is where the gap becomes clearest. In a recent peer-reviewed benchmark (arXiv:2503.20794), John Snow Labs’ de-identification system was evaluated against leading general-purpose LLMs, including those used under the hood by solutions like Presidio. Results showed that healthcare-specific models consistently outperformed general LLMs, especially in recall — a critical metric when missing PHI could mean a regulatory breach.

Highlights from the benchmark:

John Snow Labs achieved 98.6% F1-score on clinical note de-identification tasks.
General-purpose LLMs like GPT-4 lagged behind, often failing to detect domain-specific PHI such as procedure codes or rare provider names.
Presidio’s reliance on rule-based or shallow ML pipelines makes it vulnerable to contextual errors (e.g., mistaking “Dr. Lee” as a last name in non-medical contexts, but missing that it’s a PHI tag in a clinical context).

Use Case Considerations

Feature	John Snow Labs	Microsoft Presidio
Healthcare-specific PHI coverage	✅ Yes	❌ No
Pre-trained clinical models	✅ Yes	❌ No
Support for OCR/handwritten text errors	✅ Yes	❌ Limited
Deployment (on-prem, air-gapped)	✅ Yes	✅ Yes
Open-source	❌ Commercial license	✅ Yes
Performance in clinical trials/research	✅ Proven in production	❌ Not validated for medical use

Final Thoughts

If your organization is handling general enterprise documents — HR forms, support emails, or customer chats — Presidio is a flexible, free, and easy-to-integrate option.

But if you’re working with medical records, clinical notes, or healthcare datasets that must meet strict regulatory standards, John Snow Labs offers a solution purpose-built for this exact need. Its superior accuracy, regulatory-grade performance, and tailored entity coverage make it the gold standard for automated de-identification in healthcare.

And unlike general-purpose tools, John Snow Labs’ models have been battle-tested in real-world settings, with peer-reviewed benchmarks to back up their claims.

Learn more about our results and download the full benchmark report here: arXiv:2503.20794

Learn more about our results and download the full benchmark report

Download now

David Talby

Chief technology officer at John Snow Labs

Our additional expert:

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP – the world’s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

AI-Driven Oncology Insights: Unlocking Data from EHRs with NLP and LLMs

Julio Bonis

Electronic Health Records (EHRs) hold immense potential for improving oncology care. They contain detailed histories, diagnostic findings, treatment plans, and physician notes,...