What Structured NLP Does That LLMs Still Can’t: Precision Extraction at Billion-Document Scale

20.01.2026

Julio Bonis

Data Scientist at John Snow Labs

Large language models generate fluent clinical summaries and answer medical questions impressively. But when healthcare organizations need to extract structured data from millions of clinical notes with reproducible accuracy, regulatory auditability, and terminology normalization to standard code systems, they choose specialized NLP pipelines, not generative models.

MiBA’s experience, presented by Scott Newman, processing oncology data demonstrates why. Their AI-enhanced pipeline handles 1.4 million physician notes and approximately 1 million PDF reports, extracting entities with 93% F1-score and relationships with 88% F1-score. The system identifies temporal sequences (diagnosis dates, treatment timelines, disease progression), anatomical details (tumor locations, metastatic sites), and treatment-related entities (drug regimens, surgical procedures, radiation doses) with the precision clinical trial matching and registry reporting require.

This level of structured extraction, like deterministic outputs normalized to standard terminologies at billion-document scale, represents capabilities generative LLMs cannot yet match. The distinction matters: clinical decision support systems trigger alerts based on extracted codes, quality measure calculations depend on accurate entity counts, and real-world evidence studies require precisely structured longitudinal data that free-text summaries cannot provide.

Why structured extraction matters: what healthcare systems actually need

Healthcare infrastructure depends on structured, coded data for operational workflows that narrative text, no matter how fluent cannot support:

Clinical decision support requires coded triggers: When a patient record shows extracted diagnosis code “E11.65” (Type 2 diabetes with hyperglycemia), the EHR can trigger alerts for retinal screening, foot exams, and HbA1c monitoring. A generative LLM summary stating “patient has poorly controlled diabetes” cannot trigger these structured workflows.

Quality reporting depends on precise counts: HEDIS measures like “Comprehensive Diabetes Care” require identifying exactly which patients received specific screenings within defined timeframes. West Virginia University’s implementation extracted HCC diagnosis codes from clinical notes that structured fields alone would miss; enabling both accurate quality reporting and appropriate risk-adjusted reimbursement. The system surfaced findings through best practice alerts, demonstrating that structured extraction enables action, not just documentation.

Real-world evidence requires normalized longitudinal data: Pharmaceutical companies building Real World Evidence for drug efficacy need patient cohorts defined by precise inclusion criteria: “patients with Stage IIIB non-small cell lung cancer receiving platinum-based chemotherapy as first-line treatment.” MiBA’s pipeline achieving 93% entity extraction and 88% relationship extraction enables “accurate matching of patients to clinical trial enrollment criteria”, particularly when “cancer staging and biomarker findings are important inclusion criteria, as these are often missing from structured EMR data but can be recovered using NLP.”

Interoperability depends on standard terminologies: FHIR-based data exchange requires entities normalized to SNOMED CT, ICD-10, LOINC, RxNorm. TriNetX’s approach to extracting smoking status creates “structured, harmonized labels that bring consistency across networks”, enabling research that depends on standardized coding rather than free-text variations.

These workflows don’t need fluent summaries. They need deterministic, reproducible, auditable extraction of entities mapped to controlled vocabularies at scale.

Specialized NLP pipeline transforming unstructured clinical text into structured, auditable healthcare data.

Generative models excel at tasks requiring synthesis, reasoning, and natural language generation. But structured data extraction at healthcare scale exposes fundamental architectural limitations:

Challenge 1: Non-deterministic outputs undermine reproducibility

Generative LLMs sample from probability distributions, producing variable outputs for identical inputs. A clinical note mentioning “patient denies chest pain” might be extracted as “chest pain: absent” in one run, “no chest pain reported” in another, and occasionally “chest pain: present” if the model misinterprets negation. For clinical decision support or quality reporting requiring exact entity counts, this variability is disqualifying.

Providence Health’s de-identification of 700 million patient notes demonstrates why deterministic processing matters. Their production pipeline processes 100,000 to 500,000 notes, achieving a <1% PHI leak rate across validation samples. This consistency depends on rule-based entity recognition that produces identical outputs for identical inputs, enabling audit trails showing exactly which PHI entities were detected in which documents. Generative models’ variability would make such regulatory validation impossible.

Challenge 2: Terminology normalization requires controlled vocabularies

Clinical notes use hundreds of synonyms for the same concept: “MI,” “myocardial infarction,” “heart attack,” “acute coronary syndrome,” “STEMI.” Structured extraction must normalize all variants to a single standard code (ICD-10: I21.9). Generative LLMs may preserve natural language variations, use non-standard terminology, or introduce ambiguity that downstream systems cannot process.

TriNetX’s multi-site extraction demonstrates the solution: specialized NLP with entity resolution components that map extracted terms to standard terminologies. Their system handles site-level variation in documentation practices while creating harmonized output labels, enabling consistent queries across health system networks despite terminology heterogeneity in source notes.

Challenge 3: Context detection (negation, temporality, subject) requires specialized architectures

The clinical note statement “father had MI at age 52” contains entity “myocardial infarction” but with critical context: family history (not patient), past tense, age qualification. A medication list showing “aspirin – discontinued 2019” requires temporal parsing indicating the drug is NOT currently prescribed. Lab result “glucose 250 – reviewed, within expected range for post-meal timing” contains numerical extraction requiring clinical context interpretation.

Systematic physician evaluation showed specialized medical NLP achieving 35% physician preference versus GPT-4o’s 24% on clinical information extraction tasks, the 11-point gap reflects precisely these context detection capabilities. Models trained on clinical text with assertion detection, relation extraction, and temporal reasoning components reliably identify negation (“no evidence of”), temporality (“history of” versus “currently”), and subject attribution (“mother has diabetes” versus “patient has diabetes”) that general-purpose models frequently miss.

Challenge 4: Regulatory auditability requires provenance and version control

Healthcare data pipelines must satisfy regulatory requirements: which documents were processed when, which entities were extracted from which text spans, which model version and configuration produced outputs, how extracted data changed when pipelines updated. COTA’s regulatory-grade oncology data curation emphasizes “uncompromising quality” through systematic validation, requiring provenance tracking from source documents through extraction to final structured outputs.

Generative LLM APIs provide limited auditability: text goes in, text comes out, with minimal intermediate logging or version control. Structured NLP pipelines maintain document-level provenance, entity-level confidence scores, rule/model attribution for each extraction, and version-controlled configurations enabling reproducible reprocessing, all essential for regulatory compliance and quality assurance.

Challenge 5: Computational efficiency at billion-document scale

Processing hundreds of millions of clinical notes requires efficient distributed architectures. Intermountain Health’s infrastructure processes “hundreds of millions of clinical documents on Databricks Lakehouse,” with medical text summarization reducing review time from 10 minutes to 3 minutes per document, 70% efficiency gain. Their generative AI application enables natural language querying, but the underlying entity extraction and structuring uses specialized NLP pipelines optimized for throughput.

Cloud-based LLM APIs with per-token pricing become prohibitively expensive at petabyte scale. The de-identification benchmark study showed specialized Healthcare NLP reducing processing costs by over 80% compared to per-request cloud API pricing through fixed-cost local deployment on distributed Spark infrastructure. For organizations processing millions of documents monthly, this cost differential determines economic feasibility.

What specialized NLP delivers: precision, consistency, and control

Organizations implementing structured extraction at scale consistently require capabilities that specialized NLP architectures provide:

Deterministic entity recognition with clinical context: Healthcare NLP’s entity extraction components identify diagnoses, procedures, medications, lab results, anatomical references, and temporal expressions with assertion status (present/absent/possible), subject attribution (patient/family/other), and temporal context (current/historical/future). MiBA’s 93% entity extraction F1-score and 88% relationship extraction F1-score validate that specialized models achieve the precision clinical workflows require.

Terminology normalization to standard code systems: Extracted entities map to controlled vocabularies, such as SNOMED CT for clinical findings, ICD-10 for diagnoses, LOINC for lab tests, RxNorm for medications, CPT for procedures. This normalization enables interoperability: FHIR resources with standardized codes that external systems can query and process. TriNetX’s harmonized labels across multiple sites demonstrate normalization working at health system network scale.

Scalable distributed processing architecture: Apache Spark-based pipelines parallelize extraction across commodity hardware clusters. Providence’s 15-worker GPU cluster completing 100,000 notes in 44 minutes and 500,000 notes in 2.5 hours demonstrates production-ready throughput. Organizations can process entire EHR repositories, hundreds of millions of historical documents, in days rather than months, then maintain continuous extraction for incoming notes.

Regulatory-grade auditability and provenance: Every extracted entity links back to source document ID, text span, extraction timestamp, model version, and confidence score. Ohio State University’s infrastructure processing 200+ million Epic notes includes “cohort selection, de-identification with auditability, information extraction and coding, and human-in-the-loop validation”, demonstrating audit trail capabilities regulatory compliance requires.

Configurable validation and quality assurance workflows: Organizations implement targeted human review for high-stakes extractions, statistical sampling for accuracy monitoring, and feedback loops where corrections refine models. COTA’s approach to “uncompromising quality of regulatory grade data” combines automated extraction with validation workflows ensuring accuracy meets clinical research and registry reporting standards.

Real-world workflows depending on structured NLP today

Healthcare organizations deploy specialized NLP for use cases where precision, consistency, and scale are non-negotiable:

Oncology registry abstraction: Cancer registries require extracting TNM staging, histology, biomarkers, treatment regimens, and response assessments normalized to standard coding systems. MiBA’s pipeline processing 1.4 million physician notes and approximately 1 million PDF reports demonstrates the scale and accuracy requirements—93% entity extraction enabling clinical trial matching that depends on precise staging and biomarker identification.

Quality measure calculation: West Virginia University’s HCC code extraction from clinical notes demonstrates structured extraction enabling both quality reporting and risk adjustment. Their system surfaces findings through best practice alerts, showing that deterministic extraction enables closed-loop workflows, where identified gaps trigger interventions.

De-identification for research and AI training: Providence’s processing of 700 million notes with 0.81% PHI leak rate validates that specialized NLP achieves the precision HIPAA compliance requires. Systematic assessment showed Healthcare NLP achieving 96% F1-score for PHI detection versus GPT-4o’s 79%, the 17-point gap defines which systems can support regulatory-grade de-identification.

Clinical trial matching and cohort identification: MiBA’s demonstration of “accurate matching of patients to clinical trial enrollment criteria” depends on extracting cancer staging and biomarkers with precision and normalizing to terminologies trial protocols specify. Organizations building cancer data registries report similar requirements—disease response classification and sites of metastases identification requiring structured, coded outputs.

Radiology and pathology report processing: GE Healthcare’s EDISON platform processing radiology reports demonstrates multi-modal extraction requirements: dates, imaging tests, test techniques, risk factors, body parts, measurements, and findings, all linked to identify procedure type, body part location, timing, and conclusions. The system also extracts tables from unstructured text, demonstrating layout analysis capabilities.

Pharmacovigilance and adverse event reporting: Drug safety surveillance requires extracting medication names, dosages, adverse events, temporal relationships (drug started → event occurred), and causality assessments from clinical notes and case reports, with regulatory audit trails showing evidence chain from source documentation to safety database entries.

The hybrid future: combining specialized NLP with generative capabilities

The dichotomy isn’t “NLP versus LLMs”. It’s recognizing which architecture suits which task. Forward-looking implementations combine both:

Structured extraction pipeline with generative enhancement: Intermountain Health’s infrastructure demonstrates the pattern: specialized NLP extracts entities and structures data, then “generative AI applications for seamless querying of the database using natural language” enable clinicians to ask questions in plain language that get translated to structured queries over the extracted data. The extraction layer ensures precision and consistency; the generative layer provides accessible interface.

Entity extraction feeding LLM reasoning: MiBA’s approach of “combining NLP and LLMs” shows hybrid architecture: NLP pipeline extracts entities and relationships with 93% and 88% F1-scores, then LLM reasoning operates over structured extractions rather than raw text, enabling more accurate clinical trial matching than either approach alone. This pattern, extract structure with NLP, reason over structure with LLMs, appears across multiple implementations.

Quality assurance through specialized models: Organizations use Medical LLM as evaluation mechanism for assessing retrieval-augmented generation outputs, but the underlying retrieval depends on structured entity extraction. The specialized model scores aspects with transparent rubrics while the extraction pipeline ensures retrieved information is accurate, complete, and relevant.

The pattern emerging: specialized NLP handles tasks requiring deterministic accuracy, controlled vocabularies, and regulatory auditability. Generative models handle tasks requiring synthesis, reasoning, and natural language interaction. Healthcare organizations building production systems combine both rather than choosing one over the other.

Implementation requirements: infrastructure and expertise

Organizations implementing structured extraction at scale require technical capabilities beyond model access:

Distributed processing infrastructure: Processing millions of documents requires Apache Spark or equivalent distributed computing frameworks. Intermountain’s Databricks Lakehouse infrastructure and Providence’s 15-worker GPU clusters demonstrate the architectural scale. Organizations need: data ingestion pipelines from EHR systems, distributed NLP processing, structured data storage (data warehouses, lakes), and integration with downstream systems (decision support, registries, analytics platforms).

Domain-specific model libraries: Healthcare entity extraction requires models trained on clinical text recognizing medical terminology, abbreviations, and context. Healthcare NLP’s 2,800+ pre-trained models for entity extraction, relation extraction, and assertion detection provide starting points, but organizations often fine-tune for institutional terminology and documentation practices, TriNetX’s approach of handling site-level variation demonstrates this requirement.

Terminology normalization components: Entity resolution systems mapping extracted terms to SNOMED CT, ICD-10, LOINC, RxNorm, CPT require maintained terminology databases and fuzzy matching algorithms. These components distinguish “MI” → ICD-10: I21.9, “heart attack” → I21.9, and “STEMI” → I21.0 (more specific code), enabling standardized coding from natural language variations.

Quality assurance and validation workflows: Human-in-the-loop review, statistical sampling, feedback loops, and accuracy monitoring require workflows where clinical experts validate samples, corrections feed back to improve models, and performance tracking identifies drift. Ohio State’s implementation including “human-in-the-loop validation using the Generative AI Lab” demonstrates these validation patterns.

Governance and audit frameworks: Regulatory compliance requires version control (which model/configuration processed which documents), provenance tracking (linking extractions to source text), audit logs (who accessed/modified data when), and reprocessing capabilities (when models improve or errors are identified). COTA’s emphasis on “regulatory grade” quality reflects these governance requirements.

Conclusion

Generative LLMs transform how clinicians interact with medical knowledge, enabling fluent summaries, answering clinical questions, and supporting decision-making through natural conversation. But when healthcare organizations need to extract structured data from billions of clinical documents with reproducible accuracy, regulatory auditability, and terminology standardization, they choose specialized NLP pipelines optimized for precisely these requirements.

MiBA’s 93% entity extraction and 88% relationship extraction processing 1.4 million notes, Providence’s 0.81% error rate de-identifying 700 million documents, West Virginia’s HCC code extraction enabling both quality reporting and risk adjustment, and Intermountain’s 70% efficiency gain processing hundreds of millions of documents, these operational deployments validate that specialized NLP delivers precision at scale that generative models cannot yet match.

The comparative data reinforces this: Healthcare NLP achieving 96% versus GPT-4o’s 79% F1-score on PHI detection, specialized medical NLP receiving 35% physician preference versus GPT-4o’s 24% on clinical information extraction, and 80%+ cost reduction through fixed-cost local deployment versus cloud API per-request pricing.

The future of healthcare AI isn’t choosing between structured NLP and generative LLMs. It’s combining both. Specialized NLP provides the deterministic, auditable, terminology-normalized extraction that clinical infrastructure depends on. Generative models provide the synthesis, reasoning, and natural language interaction that makes systems accessible. Organizations building production healthcare AI deploy both, using each for tasks their architectures are optimized to solve.

For Chief Data Officers, clinical informatics leaders, and AI strategy teams evaluating extraction approaches: the question isn’t whether your organization needs structured extraction capabilities, quality reporting, registry abstraction, clinical decision support, and real-world evidence all depend on it. The question is whether you will build these capabilities on architectures proven to deliver regulatory-grade accuracy at billion-document scale, or discover through failed pilots that fluency doesn’t equal precision.

FAQs

What’s the accuracy difference between specialized NLP and LLMs for clinical entity extraction?

MiBA’s AI-enhanced oncology pipeline achieved 93% F1-score for entity extraction and 88% for relationship extraction processing 1.4 million physician notes. Systematic assessment showed Healthcare NLP achieving 96% F1-score versus GPT-4o’s 79% on PHI detection, a 17-point gap. Systematic physician evaluation showed medical doctors preferring specialized healthcare NLP 35% to 24% over GPT-4o on clinical information extraction tasks. The accuracy difference reflects architectural optimization: specialized NLP models trained on clinical text with entity recognition, assertion detection, and relation extraction components achieve precision that general-purpose generative models cannot match.

Why does deterministic processing matter for healthcare extraction workflows?

Clinical decision support, quality reporting, and regulatory compliance require reproducible outputs. Providence Health’s processing of 700 million patient notes with 0.81% PHI leak rate depends on deterministic extraction producing identical outputs for identical inputs, enabling audit trails showing exactly which entities were detected in which documents. Generative LLMs sample from probability distributions, producing variable outputs that undermine regulatory validation. A HEDIS measure requiring exact patient counts cannot tolerate extraction variability where the same clinical note produces different entity counts across runs.

How do specialized NLP systems normalize extracted entities to standard terminologies?

Healthcare NLP’s entity resolution components map extracted terms to controlled vocabularies, SNOMED CT for clinical findings, ICD-10 for diagnoses, LOINC for lab tests, RxNorm for medications. TriNetX’s approach creates “structured, harmonized labels that bring consistency across networks” despite site-level terminology variation. The systems maintain terminology databases, fuzzy matching algorithms, and synonym mappings, enabling “MI,” “myocardial infarction,” and “heart attack” all to normalize to ICD-10: I21.9 for standardized coding that FHIR resources and interoperability require.

What infrastructure is required for structured extraction at billion-document scale?

Intermountain Health’s Databricks Lakehouse infrastructure processing hundreds of millions of documents demonstrates requirements: data ingestion from EHR systems, distributed NLP processing on Apache Spark, Healthcare NLP entity extraction and normalization, structured data storage (warehouses/lakes), and downstream system integration. Providence Health’s 15-worker GPU cluster completing 100,000 notes in 44 minutes validates production throughput requirements. Organizations need: distributed computing frameworks, domain-specific model libraries, terminology databases, quality assurance workflows, and governance/audit capabilities.

Can organizations use both specialized NLP and generative LLMs together?

Yes, hybrid architectures combine both. Intermountain’s implementation uses specialized NLP for entity extraction and structuring, then “generative AI applications for seamless querying of the database using natural language”, the extraction layer ensures precision while the generative layer provides accessible interface. MiBA’s approach “combining NLP and LLMs” shows the pattern: NLP extracts entities/relationships with 93%/88% F1-scores, then LLM reasoning operates over structured extractions enabling more accurate clinical trial matching. Specialized NLP handles tasks requiring deterministic accuracy and controlled vocabularies; generative models handle synthesis, reasoning, and natural language interaction.

What use cases absolutely require structured extraction versus generative summaries?

Clinical decision support triggering alerts based on extracted codes, quality measure calculations requiring exact entity counts, real-world evidence studies needing normalized longitudinal data, FHIR-based interoperability depending on standardized terminologies, and regulatory reporting requiring audit trails all need structured extraction. West Virginia University’s HCC code extraction demonstrates: extracted codes surface through best practice alerts enabling clinical action, not just documentation. MiBA’s clinical trial matching requires precisely extracted staging and biomarkers normalized to terminologies trial protocols specify. Generative summaries cannot trigger these structured workflows.

How do organizations validate extraction accuracy for regulatory compliance?

Providence Health’s approach randomly sampled 1,000 notes and used human experts to validate each, 34,701 sentences total, establishing validation frameworks applicable to any extraction task. Ohio State University’s workflow includes “human-in-the-loop validation using the Generative AI Lab with full audit trails linking outputs to original documents.” COTA’s regulatory-grade approach emphasizes “uncompromising quality” through systematic validation. Organizations implement: statistical sampling for accuracy monitoring, targeted human review for high-stakes extractions, feedback loops where corrections refine models, and provenance tracking from source documents through extraction to final outputs.

What’s the cost difference between specialized NLP and cloud-based LLM APIs at scale?

The de-identification benchmark study showed Healthcare NLP reducing processing costs by over 80% compared to Azure and GPT-4o’s per-request cloud API pricing through fixed-cost local deployment. Providence’s infrastructure processing 500,000 notes in 2.5 hours on a 15-worker cluster demonstrates operational economics, organizations pay for computing infrastructure rather than per-token API fees. At million-document scale, per-request pricing becomes prohibitively expensive while distributed Spark-based NLP achieves throughput at fraction of cloud API costs. For organizations processing billions of historical documents plus continuous incoming notes, the cost differential determines economic feasibility of extraction workflows.

Julio Bonis

Data Scientist at John Snow Labs

Our additional expert:

Julio Bonis is a data scientist working on Healthcare NLP at John Snow Labs. Julio has broad experience in software development and design of complex data products within the scope of Real World Evidence (RWE) and Natural Language Processing (NLP). He also has substantial clinical and management experience – including entrepreneurship and Medical Affairs. Julio is a medical doctor specialized in Family Medicine (registered GP), has an Executive MBA – IESE, an MSc in Bioinformatics, and an MSc in Epidemiology.

Clinical De-Identification at Scale: Pipeline Design and Speed–Accuracy Trade-offs Across Infrastructures

Ozgur Caglayan

TL; DR This post presents a focused update on large-scale clinical de-identification benchmarks, emphasizing pipeline design, execution strategy, and infrastructure-aware performance. Rather...