Question 1

What does fact-level governance mean, and why does it matter?

Accepted Answer

Fact-level governance means every clinical assertion carries its own provenance, confidence score, conflict record, consent status, and version information as native attributes - not entries in a separate lineage table. Dataset-level controls (access per table, versioned snapshots) cannot answer a per-value audit question: where did this specific value come from, which model extracted it, how confident was the extraction, and how were conflicts resolved? Fact-level governance can. Under the FDA's December 2025 final RWE guidance, relevance and reliability are per-fact properties of a submission. A reviewer asking "where did this value come from?" needs one click, not a documentation retrieval project.

Question 2

Why does provenance need to be stored at extraction time rather than derived later?

Accepted Answer

Once a fact is stored without its source coordinates and model version, reconstructing that information requires re-running the extraction pipeline against the original documents with the original model - which requires that both the documents and the model are still available in the same state. In practice, models are retrained and documents are updated. Provenance stored at extraction time is exact and immutable. Provenance derived after the fact is an approximation. Regulatory audit and court-defensible evidence require the former.

Question 3

Why does export-time de-identification fail GDPR and HIPAA requirements?

Accepted Answer

Export-time de-identification fails GDPR Article 25 ("data protection by default") because it stores identified data by default and de-identifies only on the way out. It fails the HIPAA Minimum Necessary standard for research workloads because it exposes identified data to any system that queries before export. As soon as a second downstream consumer exists, someone will query the identified store for a use case that should have hit the de-identified one. The structurally correct design maintains both datasets continuously from first ingestion and routes all secondary-use queries to the de-identified dataset by default.

Question 4

What makes the audit log tamper-evident?

Accepted Answer

Each audit log entry is hashed using HMAC over its canonical fields plus the hash of the preceding entry, forming a chain. An administrator with direct database access can append new records but cannot silently edit or delete a prior one. Any modification breaks the chain at exactly that point, and the break is detectable on inspection. This property satisfies 21 CFR Part 11's requirement for audit trail integrity: the record is not just present, it is verifiably unaltered since it was written. The log captures who accessed data, what was accessed, when, why (purpose code), how (query text or MCP tool call), and the hash chain fields.

Question 5

How does consent enforcement at query time differ from consent tracked at ingestion?

Accepted Answer

Consent status tracked at ingestion reflects the patient's consent at the moment the data entered the system. If a patient later withdraws consent from research use, data ingested while they consented would still be eligible under ingestion-time enforcement - even though the patient has since withdrawn. Query-time enforcement checks the patient's current consent status at the moment of each query. A patient who withdrew consent is excluded immediately, not at the next pipeline run or the next data refresh.

Question 6

Why are model versioning and terminology versioning independent?

Accepted Answer

A terminology update - a new SNOMED CT release that adds or reorganizes concept codes - changes how concepts are mapped at the Gold tier without changing how text is extracted at the Silver tier. An extraction model retrain changes how Silver facts are produced without changing how Gold-tier business rules resolve conflicts. If these versioning layers are coupled, any change to one forces re-derivation of everything downstream. Independent versioning means a terminology update re-derives only the affected Gold mappings, leaving Silver intact. This is what makes point-in-time reconstruction tractable at population scale.

Question 7

What is the test for genuine point-in-time reproducibility?

Accepted Answer

Given a query and a timestamp, the system should return the result it would have returned at that time in seconds - not through reconstructing a snapshot or re-running a pipeline with archived parameters. Patient Journey Intelligence achieves this because all four versioning layers (datasets, extraction models, terminology releases, business rules) are independently pinned and content-addressed. The reconstruction uses the Gold dataset as it existed at the target timestamp, with the extraction model versions, terminology releases, and business rules that were active at that time. All four are pinned; the answer is deterministic, not approximate.

Question 8

How does purpose-limitation enforcement at the database layer work?

Accepted Answer

Every project is assigned an authorized scope: which data domains it can access, which patient populations, and which purpose codes. Queries outside that scope are rejected at the database layer regardless of what SQL is written. This is different from application-level enforcement, where the application decides what to query and the database executes any valid SQL. Database-layer enforcement means a misconfigured or compromised application cannot accidentally or maliciously access out-of-scope data. It also means enforcement is consistent across SQL queries, REST API calls, and MCP tool invocations - all three access patterns route through the same policy enforcement point.

Category	What Patient Journey Intelligence tracks
Identity	Fact ID; patient pseudonym (consistent across the de-identified dataset)
Clinical content	Concept code; concept system (SNOMED CT, RxNorm, LOINC, ICD-10-CM); value; effective date; assertion status (confirmed, ruled-out, family history, patient-reported, historical)
Provenance	Source document ID; source span (character offsets, DICOM tag path, or FHIR JSON pointer); extraction model ID and version; extraction confidence (0-1)
Reconciliation	Conflict set ID linking to conflicting facts from other sources; resolution method; resolver identity
Access and consent	Consent status per patient at query time; purpose code; access role; last access timestamp
Versioning	Dataset version; terminology release version; business rule version; creation timestamp

Audit question	Fields captured
Who	User identity; user role; authentication method (SSO, service token, agent token)
What	Accessed tables; accessed patient pseudonyms; rows returned
When	Timestamp with millisecond precision
Why	Purpose code (e.g., `research:trial_match:nct04567890`); project ID; agent task ID
How	Access path (SQL, REST, or MCP); query text, API endpoint, or MCP tool invocation
Tamper-evidence	Previous event's hash; this event's hash (HMAC over the canonical event plus the prior hash)

Regulatory-grade, fact-level data governance for healthcare AI

Bronze, Silver, and Gold: governance tiers, not just pipeline stages

Bronze: immutable raw record

Silver: provenance as a column type

Gold: reconciled, versioned, OMOP-mapped

Reasoning and reconciliation: the layer most platforms skip

Cross-document conflict detection

Auditable resolution with resolver identity

Temporal reasoning and effective dates

Inference with explicit confidence

Privacy by design: the three non-shares

Patient data never leaves your environment

Data is never shared with third-party aggregators

Data is never sent to LLM providers

Parallel identified and de-identified datasets

Validated at 2 billion patient notes

Audit and access control

What the audit log captures

Access control and purpose limitation

Versioning and reproducibility

Dataset versioning across all tiers

Model and prompt version pinning

Terminology release versioning

Business rule versioning

How the governance layers connect

Bronze immutability

Silver provenance

Privacy by design

Reconciliation with audit trail

Access control

Versioning and reproducibility

FAQ

What does fact-level governance mean, and why does it matter?

Why does provenance need to be stored at extraction time rather than derived later?

Why does export-time de-identification fail GDPR and HIPAA requirements?

What makes the audit log tamper-evident?

How does consent enforcement at query time differ from consent tracked at ingestion?

Why are model versioning and terminology versioning independent?

What is the test for genuine point-in-time reproducibility?

How does purpose-limitation enforcement at the database layer work?

Bronze, Silver, and Gold: governance tiers, not just pipeline stages​

Bronze: immutable raw record

Silver: provenance as a column type

Gold: reconciled, versioned, OMOP-mapped

Reasoning and reconciliation: the layer most platforms skip​

Cross-document conflict detection

Auditable resolution with resolver identity

Temporal reasoning and effective dates

Inference with explicit confidence

Privacy by design: the three non-shares​

Patient data never leaves your environment

Data is never shared with third-party aggregators

Data is never sent to LLM providers

Parallel identified and de-identified datasets​

Validated at 2 billion patient notes

Audit and access control​

What the audit log captures​

Access control and purpose limitation​

Versioning and reproducibility​

Dataset versioning across all tiers

Model and prompt version pinning

Terminology release versioning

Business rule versioning

How the governance layers connect​

Bronze immutability

Silver provenance

Privacy by design

Reconciliation with audit trail

Access control

Versioning and reproducibility

FAQ​

What does fact-level governance mean, and why does it matter?

Why does provenance need to be stored at extraction time rather than derived later?

Why does export-time de-identification fail GDPR and HIPAA requirements?

What makes the audit log tamper-evident?

How does consent enforcement at query time differ from consent tracked at ingestion?

Why are model versioning and terminology versioning independent?

What is the test for genuine point-in-time reproducibility?

How does purpose-limitation enforcement at the database layer work?

Bronze, Silver, and Gold: governance tiers, not just pipeline stages

Reasoning and reconciliation: the layer most platforms skip

Privacy by design: the three non-shares

Parallel identified and de-identified datasets

Audit and access control

What the audit log captures

Access control and purpose limitation

Versioning and reproducibility

How the governance layers connect

FAQ