The Data Governance Gap: Why Healthcare Data Platforms Fail Regulatory Compliance Audit
TL;DR: The Data Governance Gap is the distance between how most healthcare data platforms are built - pipeline-based, dataset-level governance, audit logs as an afterthought - and what modern regulatory expectations actually require: fact-level provenance, point-in-time reproducibility, parallel identified and de-identified datasets, and a tamper-evident audit trail for every access event. Closing this gap requires roughly 40 distinct capabilities built in from the first pipeline, not added later.
Why it surfaces late
The governance gap appears during a regulatory audit or FDA submission, not during development. A platform that extracts and normalizes data correctly can still fail if it cannot answer, for a single value in a single record: where did this come from, which model extracted it, how confident was the extraction, and how were conflicts resolved?
What it costs
Under the FDA's December 2025 final RWE guidance, relevance and reliability are per-fact properties of a submission. A reviewer asking 'where did this value come from?' needs a click - not a documentation retrieval project. That single requirement reshapes the architecture of every system feeding a regulatory submission.
Why it cannot be retrofitted
Provenance, confidence scoring, and conflict reconciliation are first-class attributes of every clinical fact. They must be carried through from the first parse. Adding them to a warehouse built without them requires re-architecting the pipeline from the ingestion layer up, not appending a metadata table at the end.
What dataset-level governance gets wrong
Most data warehouses govern at the level of the file or the table: access controls per dataset, versioned snapshots, dataset-level data dictionaries. That granularity works when the regulatory question is about a study population. It fails when the question is about a specific value for a specific patient.
A structured EHR field that reads "metformin 1000 mg" carries no information about whether it came from a pharmacy feed, an NLP extraction from a discharge summary, a patient-reported medication reconciliation, or a manual abstraction. When those sources disagree - and they frequently do - a dataset-level system has no record of the disagreement, no record of how it was resolved, and no way to reconstruct the decision.
The unit of governance is the fact, not the file
Dataset-level controls - access per table, versioned snapshots, dataset dictionaries - cannot answer a per-value audit question. Every clinical assertion needs its own source document, model version, confidence score, and conflict record.
Every clinical fact in a regulatory-grade system carries these attributes:
| Category | Fields |
|---|---|
| Identity | Fact ID; patient pseudonym (consistent across the de-identified dataset) |
| Clinical content | Concept code; concept system (SNOMED CT, RxNorm, LOINC, ICD-10-CM); value; effective date; assertion status (confirmed, ruled-out, family history, patient-reported, historical) |
| Provenance | Source document ID; source span (character offsets, DICOM tag path, or FHIR JSON pointer); extraction model ID and version; extraction confidence (0–1) |
| Reconciliation | Conflict set ID linking to conflicting facts from other sources; resolution method (highest confidence, most recent, human review, named rule); resolver identity |
| Versioning | Dataset version; terminology release version; creation timestamp |
Provenance, confidence, and reconciliation are first-class attributes of every clinical fact, carried by every read and propagated into every derived measure, cohort, and agent answer. Retrofitting these onto a warehouse built without them is far harder than carrying them through from the first parse. That design choice is the load-bearing one, and it has to be made before the first pipeline runs.
Why a monolithic pipeline does not survive audit
A monolithic pipeline that parses, extracts, normalizes, and reasons in one pass is straightforward to build. It is nearly impossible to reproduce six months later. The extraction model version, the terminology mapping table, the conflict-resolution rule, all of these change. Without structural separation between layers, point-in-time reproduction requires standing up the entire stack at its earlier state, which few teams can do in practice.
The architecture that holds up under audit separates the work into three independently versioned tiers:
Bronze - lossless parsing
Every file and message format ingested without information loss: free-text notes, FHIR R4 resources, HL7 v2 messages, DICOM headers, scanned PDFs with OCR, structured warehouse extracts. Nothing dropped, nothing normalized. This tier is immutable - it records exactly what arrived. When an extraction model is retrained two years later, re-extraction runs against the original Bronze records.
Silver - extraction with provenance
Clinical facts pulled from every modality, tagged with source coordinates, scored for extraction confidence, and mapped to standard terminologies. Each fact is independently traceable to a Bronze record. Silver re-derives from Bronze without touching anything upstream.
Gold - reasoning and standardization
Duplicates merged, conflicts reconciled across documents, measures and risk scores computed, result emitted in OMOP CDM v5.4. Gold re-derives from Silver, and ultimately from Bronze. From any value in Gold, an auditor traces all the way to the raw bytes.
Each tier can be validated, re-run, and audited independently. From any value in Gold, an auditor can walk all the way down to the raw bytes. The 21 CFR Part 11 expectation of reproducibility, first codified in 1997 for electronic records and now extended into a per-fact requirement under the new FDA RWE guidance, only holds when every layer is versioned and the dependencies between layers are explicit.
Why privacy architecture matters as much as audit trails
The most common privacy design mistake is treating de-identification as an export-time concern: store everything identified, de-identify on export. That design fails the GDPR Article 25 "data protection by default" test and the HIPAA Minimum Necessary standard for almost every research workload.
The export-time pattern fails
- Fails GDPR Article 25 "data protection by default"
- Fails HIPAA Minimum Necessary for most research workloads
- A second downstream consumer will query the identified store for a use case that should have hit the de-identified one
- De-identification errors discovered late require re-processing everything
- Export pipelines become an additional governance surface to audit
Parallel datasets from day one
- Both datasets maintained continuously from first ingestion
- Secondary-use queries route to the de-identified dataset by default
- Elevated permission required for any identified data access
- Privacy enforced by the platform schema, not export scripts
- PHI errors caught at ingestion, before downstream propagation
Build both datasets from the first ingestion, synchronized continuously, and route by default. By the time a second downstream consumer exists, the window to retrofit this correctly has closed.
The 40 capabilities a regulatory-grade platform requires
The reason this architecture takes years to build is not that any single component is exotic. It is that the platform needs roughly forty distinct capabilities, all of which interact, and most of which compound badly if added late. Most fail silently or require full pipeline re-runs to add after the fact. The ones flagged in the build order section must be designed in from day one.
Multimodal ingestion (Bronze tier)
1. Free-text note parsing
Structure-preserving ingestion of clinical notes: section headers, list semantics, and table extraction retained. Nothing dropped in the parse.
2. FHIR R4/R5 ingestion
Full FHIR resource ingestion with reference traversal. Linked resources resolved at ingest time, not deferred to query time.
3. HL7 v2 message parsing
Segment and field-level access across all standard HL7 v2 message types. Encoding variants and non-standard delimiters handled.
4. PDF text extraction
Text extraction for digital PDFs and OCR-required scanned documents, with layout preservation for tables and structured reports.
5. DICOM header parsing
Header extraction across all standard SOP classes. Tag-level access preserved so source coordinates can be recorded per extracted fact.
6. DICOM pixel PHI detection
Vision-language model detection of text burned into image pixels - patient names, dates, and identifiers in scanned films and screenshots.
7. Structured source connectors
Connector framework for SQL warehouses, claims feeds, and CSV imports. Schema reverse-engineering handled once at the platform level.
8. Immutable Bronze record store
Content-addressed identifiers on every ingested record. The store is append-only - nothing is overwritten when downstream models are retrained or bugs are fixed.
Clinical extraction (Silver tier)
Healthcare-specific models vs. general-purpose LLMs at extraction scale
The accuracy bar for clinical extraction is high enough that general-purpose LLMs underperform. A peer-reviewed evaluation published at the Text2Story Workshop (ECIR 2025) found that healthcare-specific small model reach 96% F1 on PHI detection, against 79% for GPT-4o, 83% for AWS Comprehend Medical, and 91% for Azure - at over 80% lower cost.
The cost economics matter for capability #14 in particular: confidence scoring runs on every fact in the corpus so a token-billed API call per fact is non-viable at population scale. Specialized small models can handle billions of routine extraction decisions, while large generative models are can be reserved for narrative generation and multi-step reasoning, where their breadth justifies their cost.
9. Healthcare NER
Named entity recognition across condition, procedure, medication, lab, and vital domains, using healthcare-specific models trained on clinical text.
10. Assertion status detection
Classifies every entity as confirmed, ruled-out, family history, patient-reported, or historical. A negative finding is never stored as a positive one.
11. Negation and uncertainty
Detects negation scope and uncertainty markers. 'No evidence of pneumonia' does not populate the pneumonia concept.
12. Temporality and date normalization
Resolves relative dates ('three weeks ago'), approximate references ('mid-2019'), and implicit dates from document context into normalized effective dates.
13. Terminology normalization
Maps all extracted concepts to SNOMED CT, RxNorm, LOINC, ICD-10-CM, and CPT. 'T2DM,' 'type 2 diabetes,' 'NIDDM,' and E11.9 resolve to the same concept.
14. Per-fact confidence scoring
Every extracted fact carries a real-valued confidence score (0–1) as a native attribute. Downstream workflows filter, flag, or route by confidence threshold.
15. Per-fact source coordinates
Every fact records its origin: document ID plus character span for text, DICOM tag path for imaging, or FHIR JSON pointer for structured resources.
Privacy and de-identification
The parallel-dataset choice in capability #22 is the one most teams get wrong. The naive design is: store everything identified, and de-identify on export. That design fails the GDPR Article 25 "data protection by default" test, and it fails the HIPAA Minimum Necessary standard for almost every research workload. The structural design is: maintain both datasets continuously, default every secondary-use query to the de-identified one, and require explicit elevated permission for any read against the identified one. Privacy then operates as a property of the schema and the query path, enforced by the platform itself.
16. PHI in free text
Detection across all 18 HIPAA Safe Harbor identifier categories in clinical notes, reports, and other narrative text.
17. PHI in PDFs
Combined text-layer and OCR coverage for scanned and digital PDFs, handling multi-column layouts and degraded scan quality.
18. PHI in DICOM headers
Tag-level PHI detection across all standard DICOM SOP classes, including burned-in annotation text in structured report objects.
19. PHI in DICOM pixels
Vision-language model detection of identifiers burned into image pixels - the modality where header-only de-identification fails.
20. Cross-document pseudonymization
Patient pseudonyms consistent across all documents and modalities, preserving longitudinal linkage through the de-identified dataset.
21. Patient-specific date shifting
Dates shifted by a patient-specific offset that preserves intra-patient temporal relationships while preventing date-correlation re-identification.
22. Parallel identified / de-identified datasets
Both datasets maintained continuously from first ingestion, synchronized by the same pipeline. Secondary-use queries route to the de-identified dataset by default.
23. Configurable de-identification profiles
Supports HIPAA Safe Harbor, HIPAA Expert Determination, and GDPR pseudonymization. Profile selection is configurable per project without pipeline changes.
Reasoning and reconciliation
Reconciliation is where most platforms silently accumulate errors. Extraction is the visible engineering; reconciliation is where the platform either picks a value without explanation and propagates quiet errors, or surfaces the conflict with its evidence and routes the decision to an auditable record.
The honest design picks the second path, which requires a reasoning layer with explicit policies for every conflict shape the data produces. That layer is months of work even after the extraction is good.
24. Patient record linkage
Cross-system patient identity resolution using deterministic and probabilistic matching. Duplicate records linked before any clinical facts are extracted.
25. Cross-document deduplication
Identical clinical events documented in multiple sources are detected and merged, with all contributing source references preserved.
26. Conflict detection
When sources disagree - chart says 80 mg, pharmacy feed says 40 mg - the conflict is detected, assigned a conflict set ID, and linked to all contributing facts.
27. Temporal reasoning
Distinguishes 'history of diabetes' from 'current diabetes,' resolves treatment timelines, and assigns effective dates consistently across sources.
28. Implied inference
Inference rules applied with documented confidence (patient started metformin → likely but not confirmed diabetic). Inferred facts are flagged as inferred, not asserted.
29. Absence-as-negative inference
Absence of a finding is inferred as negative only when scoped to documents that would plausibly mention it - not asserted globally from silence.
30. Decay rules for measurements
Measurements with clinical half-lives (vital signs, lab values, eGFR) are flagged as stale after configurable intervals, preventing outdated values from driving current decisions.
Audit, access control, and consent
What a regulatory-grade audit log entry captures
A compliant audit log answers six questions for every data access event:
| Audit question | Fields captured |
|---|---|
| Who | User identity; user role; authentication method (SSO, service token, agent token) |
| What | Accessed tables; accessed patient pseudonyms; rows returned |
| When | Timestamp with millisecond precision |
| Why | Purpose code (e.g., research:trial_match:nct04567890); project ID; agent task ID |
| How | Access path (SQL, REST, or MCP); query text, API endpoint, or MCP tool invocation |
| Tamper-evidence | Previous event's hash; this event's hash (HMAC over the canonical event plus the prior hash) |
31. Tamper-evident audit logs
Hash-chained entries for every data access event: user identity, accessed records, timestamp with millisecond precision, purpose code, and query text.
32. Role-based access control
Access control over every data asset and every column. A clinical trial coordinator sees eligibility-relevant data without access to billing or mental health records.
33. Purpose limitation at the database layer
Queries outside a project's authorized scope are rejected at the database layer regardless of the SQL written - policy is enforced below the application level.
34. Consent enforcement at query time
Consent status is tracked per patient and enforced at query time against current status, not the status recorded at data ingestion.
35. Access pattern anomaly detection
Unusual access patterns - bulk exports, off-hours queries, cross-project record access - are detected and surfaced before they become compliance events.
The tamper-evidence row is what separates an audit log from a database table. Each entry is hashed using HMAC over its canonical fields plus the hash of the preceding entry, forming a chain. An administrator with direct database access can append new events, but cannot silently edit or delete a prior entry - any modification breaks the chain at exactly that point, and the break is detectable on inspection. This is the property that satisfies 21 CFR Part 11's requirement for audit trail integrity: the record is not just present, it is verifiably unaltered since it was written.
Versioning and reproducibility
36. Immutable dataset versioning
Immutable versioning at all three tiers. Every dataset state is content-addressed and re-derivable. Nothing is overwritten in place.
37. Terminology release versioning
SNOMED CT, RxNorm, LOINC, and ICD-10-CM releases are versioned independently. Terminology updates do not silently change the meaning of prior concept mappings.
38. Business rule versioning
Every conflict-resolution policy is versioned separately from extraction models, so policy changes are independently auditable and do not invalidate prior extraction runs.
39. Model and prompt version pinning
Extraction model versions are pinned with content-addressed artifacts. Re-running extraction two years later against the same Bronze records uses the same model unless explicitly updated.
40. Point-in-time reconstruction
Given a query and a timestamp, the system returns the result it would have returned at that time. If you can ask 'what would this cohort have looked like six months ago?' and get a deterministic answer in seconds, the versioning is genuine. That is the test.
Capability #40 is the cleanest test of whether the rest of the work was done right. If you can ask "what would this cohort have looked like six months ago?" and get a deterministic answer in seconds, the versioning is genuine. If you cannot, somewhere upstream a version pin was missed.
That is the inventory. The engineering estimate, with a senior team that has built data platforms before but not this kind of healthcare-specific platform, is two to three years to land all forty capabilities in a way that holds up under audit. With a less experienced team, longer. The capabilities are not individually exotic, but they interact, and several of them (#22 parallel datasets, #15 source coordinates, #40 point-in-time reconstruction) only work if they were planned from day one.
The build order that minimizes rework
Sequence matters because several decisions, once made incorrectly, require re-running the full pipeline to fix rather than patching in place.
Pick a shared analytic data model before the first pipeline runs
OMOP CDM v5.4 is the strongest default for secondary use: cohort definition, population analytics, real-world evidence, registry abstraction. OMOP is open, peer-reviewed, used at hundreds of institutions, and compatible with the OHDSI tool ecosystem (ATLAS, Achilles, HADES). Published RWE work is largely against OMOP, which makes reproducing prior results and contributing back tractable. FHIR R4/R5 is the strongest default for single-patient analysis: clinical decision support, point-of-care AI, patient-specific question answering, and any workload where the unit of work is one patient’s record. The two are not mutually exclusive and a platform that emits into both covers the full scope. A proprietary or homegrown schema locks every downstream analysis to one stack and is the hardest decision to reverse.
Build Bronze before extraction
The first sprint is the immutable raw layer, not the first extraction model. Every extraction model you ship will eventually be replaced. Bronze is what makes the next version re-derivable.
Make provenance a column type, not a metadata table
Source document ID, source span, extraction model version, and confidence score are attributes on every fact, not entries in a separate lineage table added later. This is the design choice that retrofits worst; once facts are stored without it, adding provenance requires re-processing the entire corpus.
Build de-identification as a pipeline stage from day one
Parallel identified and de-identified datasets from first ingestion. Once a second downstream consumer exists, the export-time approach has already failed: someone, somewhere, will query the identified store for a research use case that should have hit the de-identified one. Build both datasets from the first ingestion and route by default.
Use specialized models for routine extraction; reserve large LLMs for reasoning
The cost and determinism trade-offs are settled: a 96% F1 specialized model that runs deterministically on commodity hardware beats a 79% F1 frontier LLM that costs more and returns different values run-to-run. Specialized small models handle the billions of routine extraction decisions. Large generative models should be reserved for narrative generation and multi-step reasoning, where their breadth justifies their cost. Match the tool to the task at every layer.
Build the reasoning layer as an explicit stage between Silver and Gold
Capabilities #24 through #30 (record linkage, deduplication, conflict detection, temporal reasoning, inference, and decay rules) belong in their own pipeline stage, not folded into extraction or into the Gold materialization. Reconciliation has its own policies, its own audit requirements, and its own confidence outputs. Collapsing it into extraction makes the policies invisible. Collapsing it into Gold materialization makes it unauditable. It needs to be inspectable in isolation.
Version everything from the start
Versioning is the capability most commonly deferred and least commonly actually delivered, because by the time it is revisited there is too much un-pinned state to recover without re-running the full pipeline. Pin models, prompts, terminology releases, and reasoning rules from sprint one, even when the pinning feels premature.
Treat human-in-the-loop as infrastructure
Capabilities #24 through #30 surface conflicts the system cannot resolve confidently. Those need a routing layer: a review UI that shows side-by-side evidence from all contributing sources, records the human decision in the audit trail, and feeds that decision back as a training signal. NAACCR cancer registry sign-off, NCDB abstraction review, and similar regulatory workflows already require exactly this. The platform that bolts on a review screen at the end never matches the platform that designed the review queue into the pipeline.
Centralize governance for agents via MCP
Agents call high-level platform tools (search_concepts, build_cohort, get_patient_timeline), not the underlying SQL tables. Redaction, masking, and access control apply inside one boundary before any data leaves. Adding a tenth agent does not add a tenth governance surface.
Continuous re-evaluation, not one-time audit
The capabilities above are not features you ship and forget. SNOMED CT and RxNorm release quarterly, FDA guidance updates, state privacy laws change, and specialized models drift on shifting documentation patterns. The platform that re-evaluates every dataset against current policy and current models, on a schedule, stays audit-ready. The platform that audits once a year is functionally not audited.
Where this approach still has limits
Every architecture has failure modes. Naming them is what lets you monitor against them.
Extraction is not perfect
Even peer-reviewed, regulatory-grade models miss and mis-assign facts. The pipeline reduces error and makes it visible through confidence scoring, it does not eliminate it. High confidence scores narrow the review burden; they do not replace it.
Provenance proves origin, not correctness
A confidently-wrong clinical note becomes a confidently-traced record. The architecture makes errors findable and correctable, which is the right bar, but it does not make the errors dissapear. Traceability is the precondition for correction, not a substitute for accuracy.
Specialized models require ongoing maintenance
Per-domain models for social determinants, oncology, mental health, and similar areas must be re-validated as clinical guidelines, terminology releases, and documentation patterns change. The model catalog is a recurring operational cost.
Human review is a real throughput constraint
Routing low-confidence fields to domain experts protects quality. However, expert time is finite, and a platform that does not measure and manage review throughput will silently degrade. Review queue depth is a leading indicator of data quality risk.
OMOP cannot represent everything
Some clinical nuance ('adequate organ function,' 'investigator believes the patient can comply') resists any structured model. Forcing it loses meaning; leaving it out loses completeness. The honest design surfaces what cannot be represented and routes it to human judgment, rather than hiding the gap.
Evaluation itself is hard
Gold-standard labels for clinical extraction are scarce and expensive. Accuracy claims are only as strong as the reference standard behind them, and the reference standards in healthcare are smaller and noisier than the ones in general NLP.
Naming these limits gives the auditor, the customer, and the engineering team something specific to measure against over time. Extraction accuracy is a moving target: models must be re-validated when terminology releases, clinical guidelines, or documentation patterns shift, and confidence score distributions are the early-warning signal when they start to drift. Provenance makes errors traceable and correctable, but it does not make them rare, so high-provenance systems still require ongoing human review, and review queue depth is the operational metric that predicts data quality risk before it becomes a compliance event.
OMOP's representational limits are real: some clinical nuance will always resist structured coding, and the honest response is to surface the gap explicitly rather than force a lossy mapping. Evaluation is harder than it looks in healthcare: gold-standard labels are scarce and expensive, reference datasets are smaller and noisier than in general NLP, and accuracy claims are only as strong as the reference standard behind them. A team that monitors against these failure modes (drift, review throughput, representational gaps, and evaluation quality) is in a materially better audit position than one that treats them as solved at launch.
How Patient Journey Intelligence closes the data governance gap
Patient Journey Intelligence is built around fact-level provenance from the first ingestion. Every extracted clinical fact carries its source document ID, source span, extraction model version, and confidence score as native attributes on the fact record - stored alongside the clinical content, not in a separate lineage table.
Parallel identified and de-identified datasets
Both datasets are maintained continuously from the first ingestion, synchronized by the same pipeline. PHI detection has been validated at 2 billion patient notes with zero re-identifications. Secondary-use queries route to the de-identified dataset by default; identified data requires explicit elevated permission.
Privacy by Design →
Tamper-evident audit logs
Every access event is recorded with user identity, accessed records, timestamp, purpose code, and query text. Hash-chained entries mean history cannot be rewritten - a property required for regulated submissions under 21 CFR Part 11.
Audit Logs →
Pinned model versions and point-in-time reconstruction
Every extraction model version is pinned. Point-in-time dataset reconstruction is a first-class query capability. Regulatory submissions can be reproduced exactly as they appeared at the time of submission.
AI Governance →
MCP-governed agent access
All AI agents access platform capabilities through MCP endpoints with centralized access control. Every agent inherits the same redaction, masking, and audit infrastructure. Adding a new agent does not add a new governance surface.
MCP Agents →
All 40 capabilities are in production today
Organizations that committed to fact-level provenance early will satisfy the FDA's 2026 RWE guidance and the regulatory tightening that follows. Organizations that treated governance as documentation will find, under audit, that documentation does not substitute for architecture. Patient Journey Intelligence delivers all 40 capabilities as a production platform.
FAQ
The data governance gap is the distance between how most healthcare data platforms are built: pipeline-based, dataset-level governance, audit logs added after the fact, and what modern regulatory expectations require: fact-level provenance, point-in-time reproducibility, parallel identified and de-identified datasets, and a tamper-evident audit trail for every access event. It appears during a regulatory audit or FDA submission, not during development, which is precisely what makes it expensive to fix.
Fact-level provenance means every clinical assertion carries its source document ID, source character span, extraction model version, confidence score, detected conflicts, and resolution method as native attributes on the fact record. Under the FDA's December 2025 final RWE guidance for medical devices, relevance and reliability are per-fact properties of a submission. A reviewer asking "where did this value come from?" needs one click - sourcing it through documentation retrieval is not acceptable under that standard.
Export-time de-identification fails GDPR Article 25 ("data protection by default") and the HIPAA Minimum Necessary standard for research workloads. As soon as a second downstream consumer exists, identified data gets queried for use cases that should have hit the de-identified store. The structurally correct design maintains both datasets continuously from first ingestion, synchronized by the same pipeline, and routes all secondary-use queries to the de-identified dataset by default, with explicit elevated permission required for any read against the identified one.
A tamper-evident log uses a cryptographic hash chain: each entry is hashed over its canonical fields plus the hash of the preceding entry. An administrator with direct database access can append new records but cannot silently edit or delete a prior one: any modification breaks the chain at exactly that point, and the break is detectable on inspection. That property satisfies 21 CFR Part 11's audit trail integrity requirement: the record is not just present, it is verifiably unaltered since it was written.
Each tier is independently versioned, re-runnable, and auditable. Bronze holds immutable raw ingestion. Silver re-derives from Bronze and can be regenerated when an extraction model is retrained or a bug is fixed, without touching Bronze. Gold re-derives from Silver using explicit reconciliation policies. From any value in a Gold dataset, an auditor can trace back to the raw source bytes. Point-in-time reproducibility holds because every layer is content-addressed and the dependencies between layers are recorded, not assumed.
Accuracy and cost economics. A healthcare-specific small model reaches 96% F1 on PHI detection, against 79% for GPT-4o, 83% for AWS Comprehend Medical, and 91% for Azure - independently peer-reviewed and published at ECIR 2025. At population scale, confidence scoring must run on every extracted fact in the corpus; a token-billed API call per fact is not viable. Specialized models run deterministically at over 80% lower cost per record and produce consistent outputs, a property frontier LLMs cannot guarantee run-to-run.
With a senior engineering team experienced in data platforms but new to healthcare-specific work, two to three years to deliver all 40 capabilities in a form that holds up under audit. The capabilities are not individually exotic, but several of them (e.g. parallel identified and de-identified datasets, per-fact source coordinates, point-in-time reconstruction) interact in ways that require planning from day one. Retrofitting any of those three after the first pipeline runs requires re-processing the full corpus, not patching in place.
MCP is an open standard for AI agent interoperability. When platform capabilities are exposed as MCP tools - search_concepts, build_cohort, get_patient_timeline - rather than direct SQL table access, all redaction, masking, consent enforcement, and access control apply inside one governed boundary before any data leaves the platform. Each additional agent inherits those controls automatically. Without MCP, each new agent is a new governance surface that must be independently secured and audited.
Conflicts are detected at the Silver tier, assigned a conflict set ID, and linked to all contributing facts with their source coordinates and confidence scores. The reconciliation layer applies an explicit resolution policy - highest extraction confidence, most recent document, named business rule, or human review - and records the resolver identity and method on the fact. Every resolved conflict is fully auditable: which sources disagreed, what the policy was, and who or what made the decision.
SNOMED CT and RxNorm release quarterly. FDA guidance updates. State privacy laws change. Extraction models drift as clinical documentation patterns shift. A platform that re-evaluates every dataset against current terminology releases, current extraction models, and current access policies - on a defined schedule - stays audit-ready. Confidence score distributions are the early-warning signal: when a model starts drifting on a documentation pattern, the distribution shifts before accuracy metrics surface the problem. Monitoring that distribution is the operational practice that keeps a regulatory-grade platform regulatory-grade.