Regulatory-grade, fact-level data governance for healthcare AI
Most data warehouses govern at the level of the file or the table. That granularity is sufficient for many workloads. It is not sufficient for clinical AI that feeds regulatory submissions, drives care decisions, or needs to answer an auditor's question about a specific value for a specific patient.
The data governance gap describes what happens when platforms built for dataset-level governance encounter per-fact audit requirements. This page describes how Patient Journey Intelligence resolves each of those requirements - what the platform actually implements, organized around the same unit of governance the problem demands: the clinical fact.
Every clinical assertion in Patient Journey Intelligence carries six categories of attributes as native columns on the fact record, not entries in a separate lineage table added after the fact:
| Category | What Patient Journey Intelligence tracks |
|---|---|
| Identity | Fact ID; patient pseudonym (consistent across the de-identified dataset) |
| Clinical content | Concept code; concept system (SNOMED CT, RxNorm, LOINC, ICD-10-CM); value; effective date; assertion status (confirmed, ruled-out, family history, patient-reported, historical) |
| Provenance | Source document ID; source span (character offsets, DICOM tag path, or FHIR JSON pointer); extraction model ID and version; extraction confidence (0-1) |
| Reconciliation | Conflict set ID linking to conflicting facts from other sources; resolution method; resolver identity |
| Access and consent | Consent status per patient at query time; purpose code; access role; last access timestamp |
| Versioning | Dataset version; terminology release version; business rule version; creation timestamp |
The rest of this page walks through how each category is implemented across the platform's five governance layers.
Bronze, Silver, and Gold: governance tiers, not just pipeline stages
The three-tier architecture is often described as a data engineering pattern. It is also a governance pattern. Each tier has one testable job, and that testability is what makes audit defensible.
Bronze: immutable raw record
Every ingested document, FHIR resource, HL7 message, DICOM header, and structured feed is stored with a content-addressed identifier in an append-only store. Nothing is dropped in the parse. Nothing is overwritten when a downstream model is retrained or a bug is fixed. Any extraction model you ship will eventually be replaced - Bronze is what makes the next version re-derivable without losing history.
Silver: provenance as a column type
Every extracted clinical fact is tagged at extraction time with its source document ID, source character span, extraction model version, and confidence score. These are columns on the fact record - not entries in a separate lineage table and not derivable after the fact. Once facts are stored without provenance, adding it requires re-processing the full corpus. That is the design choice that breaks other platforms under audit.
Gold: reconciled, versioned, OMOP-mapped
Conflicts detected at Silver are resolved at Gold using explicit policies with logged resolver identity and method. Measures are computed. OMOP CDM v5.4 is emitted. Every reconciliation decision is auditable: which sources disagreed, which resolution rule applied, and who or what resolved it. The Gold dataset carries both the clinical content and the full chain of decisions that produced it.
Point-in-time reproducibility holds across all three tiers because each is independently versioned and content-addressed. An auditor can ask what the Silver dataset looked like when a specific submission was filed and get the exact state, not a reconstruction.
Reasoning and reconciliation: the layer most platforms skip
The Silver tier surfaces conflicts. The reasoning layer resolves them. This is the stage most platforms either collapse into extraction or defer to the Gold materialization. Both choices break governance.
Folding reconciliation into extraction makes the resolution policies invisible - you cannot audit a policy that was never explicitly stated. Folding it into Gold materialization makes it unauditable - the decision happened, but there is no record of how. Patient Journey Intelligence implements reconciliation as an explicit stage between Silver and Gold, with its own versioned policies and its own audit outputs.
Cross-document conflict detection
When sources disagree (e.g. chart says 80 mg, pharmacy feed says 40 mg) the conflict is detected, assigned a conflict set ID, and linked to all contributing facts. No value is silently promoted. The disagreement is recorded before any resolution decision is made.
Auditable resolution with resolver identity
Each conflict is resolved by an explicit policy: highest extraction confidence, most recent document, a named business rule, or human review. The resolver identity and method are logged on the Gold fact. Changing the policy creates a new version. The prior version's decisions remain auditable.
Temporal reasoning and effective dates
Distinguishes 'history of diabetes' from 'current diabetes,' resolves relative dates ('three weeks ago') into normalized effective dates, and assigns clinical currency consistently across sources with different documentation conventions.
Inference with explicit confidence
When a fact is inferred rather than directly asserted - a patient started metformin, suggesting but not confirming a diabetes diagnosis - the inference is flagged as inferred, not asserted, and carries its own confidence score. Absence-as-negative inference is scoped only to documents that would plausibly mention the finding.
The practical consequence: when an agent or query returns a clinical value, the full path from raw source to resolved Gold fact is traceable. The question "why does this patient appear in this cohort?" has a deterministic, auditable answer.
Privacy by design: the three non-shares
Privacy architecture in Patient Journey Intelligence is defined by three things that do not happen:
Patient data never leaves your environment
All models run on-premises or in your private cloud. No PHI is transmitted to any external service, API, or model provider. Air-gapped deployments are fully supported.
Data is never shared with third-party aggregators
Patient data is not shared with EHR vendors, data aggregators, or any party outside your organization's infrastructure, regardless of existing vendor relationships.
Data is never sent to LLM providers
Medical LLMs and Healthcare NLP models run locally. No clinical text, patient identifiers, or derived facts reach any external inference endpoint.
Parallel identified and de-identified datasets
The most common design mistake in de-identification is storing everything identified and de-identifying on export. That approach fails GDPR Article 25 ("data protection by default") and the HIPAA Minimum Necessary standard for research workloads. Once a second downstream consumer exists, someone will query the identified store for a use case that should have routed to the de-identified one.
Patient Journey Intelligence maintains both datasets continuously from first ingestion, synchronized by the same pipeline. Secondary-use queries route to the de-identified dataset by default. Any read against the identified dataset requires explicit elevated permission, logged in the audit trail with the purpose code.
Validated at 2 billion patient notes
John Snow Labs' de-identification pipeline has been validated in production at Providence across 2 billion patient notes with zero re-identifications, achieving 99%+ PHI detection accuracy - independently peer-reviewed and red-teamed. Another evaluation published at ECIR 2025 measured 96% F1 against 79% for GPT-4o, 83% for AWS Comprehend Medical, and 91% for Azure, at over 80% lower cost per record.
PHI detection covers all modalities: free text in clinical notes, text layers and OCR output in PDFs, DICOM header tags, and identifiers burned into DICOM image pixels - the modality where header-only de-identification consistently fails. Pseudonyms are consistent across all documents for the same patient, preserving longitudinal linkage. Dates are shifted by a patient-specific offset that preserves intra-patient temporal relationships without enabling date-correlation re-identification across systems.
De-identification UI showing original and de-identified versions of DICOM, PDF, and plain-text clinical documents side by side, with associated PHI detection statistics.
Three de-identification profiles are configurable per project without pipeline changes: HIPAA Safe Harbor, HIPAA Expert Determination, and GDPR pseudonymization.
Audit and access control
Audit Logs in Patient Journey Intelligence: a summary dashboard with access metrics, filtering by user, purpose, and time range, and a full table of tamper-evident log entries.
What the audit log captures
Every data access event is recorded with six categories of fields, forming a hash chain that makes the log tamper-evident:
| Audit question | Fields captured |
|---|---|
| Who | User identity; user role; authentication method (SSO, service token, agent token) |
| What | Accessed tables; accessed patient pseudonyms; rows returned |
| When | Timestamp with millisecond precision |
| Why | Purpose code (e.g., research:trial_match:nct04567890); project ID; agent task ID |
| How | Access path (SQL, REST, or MCP); query text, API endpoint, or MCP tool invocation |
| Tamper-evidence | Previous event's hash; this event's hash (HMAC over the canonical event plus the prior hash) |
Each entry is hashed over its canonical fields plus the hash of the preceding entry. An administrator with direct database access can append new records but cannot silently edit or delete a prior one. Any modification breaks the chain at exactly that point, and the break is detectable on inspection. This satisfies 21 CFR Part 11's requirement for audit trail integrity: the record is not just present, it is verifiably unaltered since it was written.
Access control and purpose limitation
Role-based access control operates at the column level, not just the table level. A clinical trial coordinator sees eligibility-relevant data without access to billing or mental health records. A pharma researcher sees de-identified population data without access to the identified store.
Purpose limitation is enforced at the database layer per 45 CFR 164.502(b). Queries outside a project's authorized scope are rejected regardless of the SQL written. Access policy is not enforceable only at the application level - enforcement below the application layer is what makes it auditable and what prevents policy drift as the number of consumers grows.
Access pattern anomaly detection surfaces bulk exports, off-hours queries, and cross-project record access before they become compliance events.
Versioning and reproducibility
The test for genuine versioning is simple: given a query and a timestamp, can the system return the result it would have returned at that time? If yes, the versioning is real. If the answer requires reconstructing a snapshot or re-running a pipeline with archived parameters, the versioning is aspirational.
Patient Journey Intelligence implements versioning across four independent layers, each of which can change without invalidating the others:
Dataset versioning across all tiers
Bronze, Silver, and Gold datasets are content-addressed and immutable. Any prior state is addressable and re-derivable. Model retraining appends a new Silver version; the prior version remains intact and queryable.
Model and prompt version pinning
Extraction models are pinned with content-addressed artifacts. Re-running extraction two years later against the same Bronze records uses the same model version unless explicitly updated. Prompt versions for LLM-based steps are pinned identically.
Terminology release versioning
SNOMED CT and RxNorm release quarterly. ICD-10-CM and LOINC update annually. Each release is versioned independently in the platform. A terminology update does not silently change the meaning of prior concept mappings — the mapping version is recorded on every Gold fact.
Business rule versioning
Every conflict-resolution policy in the reasoning layer is versioned separately from the extraction models. A policy change - updating which source wins in a medication conflict - creates a new business rule version. The change is independently auditable and does not invalidate prior extraction runs.
Point-in-time reconstruction is a first-class query capability. You can ask what a cohort would have looked like six months ago and receive a deterministic answer in seconds. That answer uses the Gold dataset as it existed at that timestamp, the extraction model versions active at that time, the terminology releases in effect, and the business rules then in force. All four layers are independently pinned, so the reconstruction is exact - not approximate.
This is the capability regulatory submissions and retrospective audits require. Versioning that cannot reconstruct a prior state on demand is not versioning for audit purposes.
How the governance layers connect
The five layers above are not independent features. They are sequential dependencies - each one requires the ones before it to function correctly.
Bronze immutability
Every ingested record is stored with a content-addressed identifier. This is the foundation - without it, re-derivability and point-in-time reconstruction are impossible.
Silver provenance
Every extracted fact carries source coordinates and confidence at extraction time. This must be designed in from the first pipeline run - retrofitting it requires re-processing the full corpus.
Privacy by design
Parallel identified and de-identified datasets from first ingestion. PHI detection across all modalities. Default routing to de-identified for all secondary use.
Reconciliation with audit trail
Conflicts resolved by versioned policies with logged resolver identity. Every Gold value is traceable to its resolution decision.
Access control
Purpose limitation at the database layer. Tamper-evident audit log for every access event.
Versioning and reproducibility
Independent versioning for datasets, models, terminology, and business rules. Point-in-time reconstruction as a first-class query capability.
Regulatory-grade data governance is not a compliance layer you add to an existing platform. It is a set of design decisions that must be made at the time of first ingestion - before the first fact is extracted, before the first conflict is resolved, before the first query touches identified data. The six layers described here represent those decisions as implemented architecture: fact-level provenance, tiered immutability, privacy by design, explicit reconciliation, tamper-evident audit, and point-in-time versioning. Together they make Patient Journey Intelligence auditable, reproducible, and defensible - for regulatory submissions, for IRB review, for legal discovery, and for the clinicians and researchers who rely on the data every day.
To learn more about AI model governance, model registry, performance monitoring, bias detection, and explainability visit AI Governance. For more details on privacy architecture and de-identification implementation details, see Privacy by Design.
FAQ
Fact-level governance means every clinical assertion carries its own provenance, confidence score, conflict record, consent status, and version information as native attributes - not entries in a separate lineage table. Dataset-level controls (access per table, versioned snapshots) cannot answer a per-value audit question: where did this specific value come from, which model extracted it, how confident was the extraction, and how were conflicts resolved? Fact-level governance can. Under the FDA's December 2025 final RWE guidance, relevance and reliability are per-fact properties of a submission. A reviewer asking "where did this value come from?" needs one click, not a documentation retrieval project.
Once a fact is stored without its source coordinates and model version, reconstructing that information requires re-running the extraction pipeline against the original documents with the original model - which requires that both the documents and the model are still available in the same state. In practice, models are retrained and documents are updated. Provenance stored at extraction time is exact and immutable. Provenance derived after the fact is an approximation. Regulatory audit and court-defensible evidence require the former.
Export-time de-identification fails GDPR Article 25 ("data protection by default") because it stores identified data by default and de-identifies only on the way out. It fails the HIPAA Minimum Necessary standard for research workloads because it exposes identified data to any system that queries before export. As soon as a second downstream consumer exists, someone will query the identified store for a use case that should have hit the de-identified one. The structurally correct design maintains both datasets continuously from first ingestion and routes all secondary-use queries to the de-identified dataset by default.
Each audit log entry is hashed using HMAC over its canonical fields plus the hash of the preceding entry, forming a chain. An administrator with direct database access can append new records but cannot silently edit or delete a prior one. Any modification breaks the chain at exactly that point, and the break is detectable on inspection. This property satisfies 21 CFR Part 11's requirement for audit trail integrity: the record is not just present, it is verifiably unaltered since it was written. The log captures who accessed data, what was accessed, when, why (purpose code), how (query text or MCP tool call), and the hash chain fields.
Consent status tracked at ingestion reflects the patient's consent at the moment the data entered the system. If a patient later withdraws consent from research use, data ingested while they consented would still be eligible under ingestion-time enforcement - even though the patient has since withdrawn. Query-time enforcement checks the patient's current consent status at the moment of each query. A patient who withdrew consent is excluded immediately, not at the next pipeline run or the next data refresh.
A terminology update - a new SNOMED CT release that adds or reorganizes concept codes - changes how concepts are mapped at the Gold tier without changing how text is extracted at the Silver tier. An extraction model retrain changes how Silver facts are produced without changing how Gold-tier business rules resolve conflicts. If these versioning layers are coupled, any change to one forces re-derivation of everything downstream. Independent versioning means a terminology update re-derives only the affected Gold mappings, leaving Silver intact. This is what makes point-in-time reconstruction tractable at population scale.
Given a query and a timestamp, the system should return the result it would have returned at that time in seconds - not through reconstructing a snapshot or re-running a pipeline with archived parameters. Patient Journey Intelligence achieves this because all four versioning layers (datasets, extraction models, terminology releases, business rules) are independently pinned and content-addressed. The reconstruction uses the Gold dataset as it existed at the target timestamp, with the extraction model versions, terminology releases, and business rules that were active at that time. All four are pinned; the answer is deterministic, not approximate.
Every project is assigned an authorized scope: which data domains it can access, which patient populations, and which purpose codes. Queries outside that scope are rejected at the database layer regardless of what SQL is written. This is different from application-level enforcement, where the application decides what to query and the database executes any valid SQL. Database-layer enforcement means a misconfigured or compromised application cannot accidentally or maliciously access out-of-scope data. It also means enforcement is consistent across SQL queries, REST API calls, and MCP tool invocations - all three access patterns route through the same policy enforcement point.