Architectural privacy by design for healthcare AI

Most healthcare software treats privacy as a compliance layer bolted on after the architecture is set: a policy document, a data use agreement, a de-identification step at export time. In Patient Journey Intelligence, privacy is structural. The platform has no mechanism to move patient data outside your environment, and every query path defaults to de-identified data - before any user, any pipeline, or any configuration asks it to.

What follows covers where your data lives, how PHI is detected and removed across every modality your organization works with, and why the design of the de-identification pipeline matters more than any individual profile setting.

Your data stays in your environment

Three guarantees apply to every Patient Journey Intelligence deployment:

No data sharing with John Snow Labs

The platform runs inside your VPC, data center, or private cloud. John Snow Labs has no telemetry channel to patient records, query results, or derived datasets.

No data sharing with third-party vendors

Patient Journey Intelligence does not depend on external analytics services, data brokers, or cloud AI APIs. There is no background pipeline moving data to any outside organization.

No data sharing with LLM providers

All language model inference runs on your infrastructure using models deployed within your environment. No patient record, clinical note, or derived fact is transmitted to OpenAI, Google, Anthropic, or any other model API.

These guarantees are architectural, not policy-based. The platform has no network paths to external services by design.

A policy guarantee says "we promise not to send your data out." An architectural guarantee says "the system has no way to send your data out." The second holds even if a configuration is misconfigured, a vendor relationship changes, or a future software update is deployed without a full security review. When a researcher runs a natural language query against patient records and the platform responds using an LLM, that inference is happening on a model running inside your infrastructure, not on a request sent to an external API.

Keeping data inside your environment prevents external exposure. It says nothing about how that data circulates internally, who can access it, or in what form. A platform that locks all patient data inside your VPC but routes every researcher query through fully identified records has solved one problem and left the other open. The second half of the design is de-identification: once data is inside your environment, access to the identified form is the exception, not the default.

Containment defines the perimeter - patient data does not leave. De-identification defines the interior - within that perimeter, the platform structures access so that research, analytics, and AI workloads run against de-identified data by default, and access to identified records requires explicit permission. Containment without de-identification leaves every internal workload running against full PHI. De-identification without containment reduces internal overexposure but leaves an external channel open. Patient Journey Intelligence closes both gaps as part of the same architecture.

PHI detection across every modality

Healthcare data does not live in a single place or format. A patient's record spans structured database fields, free-text clinical notes, scanned documents, and medical imaging. PHI appears in all of them, in forms that differ enough to require purpose-built detection for each.

Patient Journey Intelligence provides PHI detection across all of the modalities your organization is likely to work with:

Free text. The platform detects PHI across all 18 HIPAA Safe Harbor identifier categories in clinical narratives, discharge summaries, progress notes, and any other free-text field. This includes names, dates, geographic identifiers, phone numbers, email addresses, Social Security numbers, medical record numbers, account numbers, certificate numbers, IP addresses, URLs, device identifiers, biometric identifiers, and full-face photographs. Natural language is the hardest medium to de-identify reliably because PHI appears in unpredictable positions, is subject to abbreviation and misspelling, and is often embedded in clinical context that makes naive pattern matching unreliable. The platform handles this using models trained specifically for clinical NLP.

PDF documents. PDFs present two distinct challenges that require two distinct approaches. Documents with a text layer - digitally created PDFs - require text extraction followed by NLP-based PHI detection. Scanned documents have no text layer; they require OCR to produce readable text before detection can begin. Patient Journey Intelligence combines both approaches, applying text-layer extraction where available and falling back to OCR where it is not. A referral letter from an external provider, a signed consent form, or a historical chart document all receive the same PHI detection coverage regardless of how the PDF was produced.

DICOM headers. Medical imaging files in DICOM format carry structured metadata in their headers: patient name, date of birth, study date, referring physician, institution name, and dozens of additional fields depending on the imaging modality. These headers must be detected and cleared before imaging data can be used in research or shared across systems. Patient Journey Intelligence identifies and removes PHI from DICOM header fields as part of the standard de-identification pipeline.

DICOM pixels - burned-in text. Some imaging modalities, particularly older equipment, write patient information directly into the pixel data of the image rather than only in the header. An ultrasound scan might have the patient's name and study date overlaid in the corner of the image in the same way a broadcast banner appears on a television screen. Standard DICOM header de-identification leaves this burned-in text untouched. Patient Journey Intelligence addresses this using vision-language models that detect and redact text embedded directly in image pixels, covering the modality gap that header-only de-identification misses.

Pseudonymization that works across the full record

Removing PHI field by field solves one problem and creates another. If a patient's name is replaced with a random string in one document and a different random string in another, any research that requires linking records across documents - tracking a patient's treatment trajectory, identifying readmissions, building a longitudinal cohort - becomes impossible. De-identification cannot trade privacy for utility and call the result a success.

Patient Journey Intelligence handles this through consistent pseudonymization: the same patient receives the same pseudonym across every document and every modality, regardless of when those records were ingested or how many pipelines processed them. A researcher working with a de-identified cohort can still link a patient's imaging records to their clinical notes to their lab results, because the linkage identifier is consistent across all of them - while remaining entirely disconnected from any real-world identity.

Date shifting. Dates are among the most sensitive HIPAA identifiers because they enable re-identification through correlation. If a patient's exact diagnosis date is known, that date can be matched against external records - insurance claims, news coverage, public registries - to identify the individual behind an otherwise de-identified record. Patient Journey Intelligence shifts dates by a randomized offset applied consistently to each patient. The offset is patient-specific: every date in a given patient's record shifts by the same amount, so the temporal structure of their care journey is preserved. A patient whose chemotherapy started 45 days after their surgery still shows a 45-day gap in the de-identified record - but the absolute dates no longer correspond to calendar dates that could be matched against external sources.

The parallel dataset architecture

How identified and de-identified data are maintained relative to each other is the design decision that determines whether de-identification holds in practice or collapses under the first real workload.

The approach most organizations default to: store everything in identified form, de-identify on export. A researcher needs de-identified data, the system de-identifies at query time, returns the result. Simple to implement, easy to explain.

It fails for two reasons. It fails the GDPR Article 25 "data protection by default" test: Article 25 requires that only data necessary for a specific purpose is processed and made accessible by default. Storing everything identified and de-identifying on demand inverts this - the default state is identified access. It also fails the HIPAA Minimum Necessary standard for almost every research workload. If the default query path hits identified data, every researcher query operates against PHI regardless of whether that researcher needs it. The Minimum Necessary standard requires limiting access to the minimum PHI necessary for the intended purpose; routing all queries through identified data cannot satisfy that for researchers who only need de-identified records.

Patient Journey Intelligence maintains two versions of every record simultaneously: an identified dataset and a de-identified dataset, kept in sync by the same pipeline that ingests source data. Every update to the identified record is reflected in the de-identified record without a separate export step or manual re-de-identification job.

The de-identified dataset is the default for every secondary-use query. A researcher analyzing a cohort, a registry abstractor reviewing cases, a data analyst building a dashboard - each hits the de-identified dataset without any configuration, without an explicit de-identification step, and without any awareness that a separate dataset exists. Access to the identified dataset requires explicit elevated permission and generates a separate audit log entry.

Privacy operates as a property of the schema and the query path, enforced by the platform - not as a step individual users or pipelines are expected to remember.

Configurable de-identification profiles

Different use cases have different de-identification requirements, and different regulatory frameworks specify different standards. Patient Journey Intelligence supports configurable profiles:

Profile	Standard	Use case
HIPAA Safe Harbor	18 identifier categories fully removed	Research datasets shared with external collaborators or submitted to registries
Expert Determination	Statistical standard; residual re-identification risk ≤ 0.04%	High-utility research where Safe Harbor removes too much information
GDPR pseudonymization	Personal data replaced with pseudonyms; re-identification key held separately	European data subjects; pseudonymization rather than full anonymization

A cancer registry preparing data for submission to a national registry applies Safe Harbor. A biomarker research team that needs age in years rather than age ranges might apply Expert Determination, with statistical validation of residual re-identification risk. A European clinical research program subject to GDPR applies GDPR pseudonymization, maintaining the re-identification key under access controls separate from the research dataset.

Profile selection is a compliance decision, not a technical one

The choice of de-identification profile should be made in consultation with your IRB, privacy officer, and legal counsel. The technical platform can implement any of the profiles above; which one applies to a given dataset depends on the regulatory context, the nature of the data, and the intended use.

The architecture described on this page maps directly to the technical requirements of the frameworks most healthcare organizations operate under.

Requirement	How Patient Journey Intelligence addresses it
HIPAA Privacy Rule § 164.514(b) - Safe Harbor de-identification	PHI detection and removal across all 18 identifier categories in free text, PDF, DICOM headers, and DICOM pixels
HIPAA Privacy Rule § 164.514(e) - Limited data sets	Date shifting and geographic identifier handling with patient-specific offsets
HIPAA Minimum Necessary § 164.502(b)	De-identified dataset as the default query path; identified access requires explicit elevated permission
GDPR Article 25 - Data protection by default	On-premises deployment with no external data movement; de-identified data as the structural default
GDPR Article 89 - Safeguards for research processing	Pseudonymization with re-identification key held separately; configurable GDPR pseudonymization profile

Your compliance and legal teams should verify specific control mappings against your organization's risk analysis and the data sharing agreements in scope for each dataset.

FAQ

How does the platform guarantee that LLM inference doesn't send data to external providers?

All language model inference runs on models deployed within your infrastructure. The platform architecture has no network path to external model APIs - there is no configuration switch that would route inference to OpenAI, Google, Anthropic, or any other external service. The guarantee is structural: the capability does not exist in the platform, not merely disabled by policy.

What happens when new records are ingested? Do both datasets stay in sync automatically?

Yes. The ingestion pipeline maintains both the identified and de-identified datasets as outputs of the same process. When a new record is ingested or an existing record is updated, the de-identified version is produced as part of that same pipeline run. There is no separate export step, no manual re-de-identification job, and no lag between when identified data is available and when the corresponding de-identified data is available.

Can the platform handle PHI in older scanned documents where there is no text layer?

Yes. For PDF documents without a text layer - scanned paper records, faxed referrals, historical chart documents - the platform applies OCR to produce readable text, then runs PHI detection against the OCR output. The same 18-category detection applies regardless of whether the document was digitally created or scanned.

How does consistent pseudonymization work when a patient's name is spelled differently across documents?

PHI detection in clinical text operates at the entity level, not the string level. The platform identifies that a string is a patient name, links it to the correct patient record through your source identifiers, and applies the pseudonym assigned to that patient - regardless of whether the name appears as "John Smith," "J. Smith," or "Jonathan Smith" across different documents. The linkage is driven by structured identifiers (MRN, encounter ID) rather than string matching on the PHI itself.

How does this page relate to the other governance pages?

This page covers how privacy is built into the platform's deployment model and data architecture. Security and Compliance covers the technical security controls: network isolation, identity management, encryption, and audit logging. Data Governance covers fact-level provenance, versioning, and audit trails for clinical data. AI Governance covers model transparency, confidence scoring, and human-in-the-loop controls for AI outputs.

Related pages: Security and Compliance · Data Governance · AI Governance

Your data stays in your environment​

No data sharing with John Snow Labs

No data sharing with third-party vendors

No data sharing with LLM providers

PHI detection across every modality​

Pseudonymization that works across the full record​

The parallel dataset architecture​

Configurable de-identification profiles​