Skip to main content

Independent AI governance and model validation for healthcare AI

Every AI model in Patient Journey Intelligence - the healthcare-specific NLP models for extraction and de-identification, the large language models for clinical reasoning, and the platform's agentic workflows - is governed, tested, and continuously monitored through Pacific AI, a CHAI-certified responsible AI platform purpose-built for healthcare.

This is not self-certification. Pacific AI is an independent, CHAI-certified Assurance Resource Provider that validates AI systems against 250+ AI laws, standards, and regulations - with quarterly updates as the regulatory landscape changes. The 96% F1 de-identification accuracy that Patient Journey Intelligence publishes is measured against MedHELM, a Stanford CRFM benchmark covering 121 clinical tasks across 35 benchmarks. The number is reproducible and independently verifiable, not a self-reported internal metric.

What gets governed - the full model inventory

Most healthcare AI platforms govern their NLP models. Patient Journey Intelligence governs three distinct categories of AI systems, all through the same Pacific AI lifecycle:

Healthcare-specific NLP models

The specialized small language models that do entity extraction, de-identification, assertion status detection, and medical terminology normalization. These are, for instance, the models benchmarked against GPT-4o, AWS Comprehend Medical, and Azure in the ECIR 2025 evaluation.

LLMs for reasoning and generation

The large language models used for multi-step clinical reasoning, narrative answer generation, and agent responses. These go through separate evaluation suites targeting medical cognitive bias - anchoring bias, confirmation bias, availability bias - and adversarial prompts designed to elicit HIPAA violations or ethical failures.

Agentic workflows and MCP tool calls

The platform's agentic sequences - cohort building, patient journey construction, registry population - are tested as systems, not just as individual models. Pacific AI's Gatekeeper is MCP-native, which means the governance pipeline covers tool call sequences and multi-step agent behavior, not just single-model outputs.

The practical consequence: an auditor asking "was this cohort assignment AI-governed?" gets the same answer whether the assignment came from an NLP extraction model, an LLM reasoning step, or an agentic workflow that combined both.

How verification works: Governor, Gatekeeper, Guardian

Pacific AI structures AI governance across three components that map directly onto the AI lifecycle - before deployment, at release, and in production. All three operate inside your AWS or Azure tenant.

Governor: centralized registry and risk assessment

Governor maintains the centralized inventory of every AI system deployed in Patient Journey Intelligence - model type, version, training data provenance, risk classification, vendor assessment, and the policies it operates under. This is the source of record an auditor reaches first.

The registry is not manually maintained. Governor uses patent-pending AI to analyze system specifications against organizational policies and risk knowledge bases, automatically generating impact assessments, risk registries, and mitigating controls. Model cards are CHAI-compliant and meet ONC HTI-1 and California AB 2013 documentation standards. When a new model version ships, Governor updates the model card automatically - the documentation stays current without a manual process.

Vendor risk management included

Governor assesses every third-party AI system and capability provider through automated scoring - not just John Snow Labs' own models, but any external model or service integrated into a customer deployment. Vendor contracts and control requirements are version-tracked alongside internal model records.

Gatekeeper: pre-release CI/CD testing

Gatekeeper runs automated test suites as a CI/CD gate before any model version reaches production. No model ships without passing. The test suites cover six domains relevant to clinical AI:

Test domainWhat it catches
Clinical performanceAccuracy on real-world benchmarks: clinical decision support, note generation, patient communication, administrative workflows - evaluated against MedHELM's 121 clinical tasks across 35 benchmarks
Robustness and biasDemographic bias across patient populations; clinical data perturbation testing to detect performance fragility
Continuous red teamingAdversarial prompts targeting ethical violations, HIPAA breaches, and jailbreak attempts
Medical cognitive biasesAnchoring bias, confirmation bias, and availability bias in clinical reasoning models
Regulatory complianceEnforcement of HHS HTI-1, ACA Section 1557, NIST AI RMF, and applicable state AI laws
Agentic safetyMCP-native testing of tool call sequences, multi-step agent behavior, and workflow-level failure modes

Gatekeeper results are published directly to the model card in Governor, creating a continuous audit trail from test run to deployed version. An IRB reviewer or regulatory auditor can trace any production model version back to the specific test suite results that cleared it for release.

Guardian: continuous production monitoring

Guardian runs after deployment - continuously monitoring model performance, detecting bias, and protecting against adversarial attacks in real-time. This is where the governance loop closes: a model that passed Gatekeeper on release day may drift as patient populations shift, documentation practices evolve, or adversarial inputs accumulate.

Guardian provides 360° monitoring across accuracy, performance, robustness, fairness, safety, and ethics. Drift detection runs across all six dimensions simultaneously. Configurable testing schedules balance inference cost against monitoring frequency. When Guardian flags a disparity - say, a de-identification model performing below threshold for a specific demographic - the alert flows to Governor, where it is logged against the model card and policy record, and the remediation path through Gatekeeper is gated before the retrained version ships.

The open-source evaluation standards: MedHELM and LangTest

Two open-source projects anchor the evaluation methodology - and both are stewarded by Pacific AI, making the standards publicly inspectable.

MedHELM (developed with Stanford CRFM) covers 121 distinct real-world clinical tasks across 5 categories, 22 subcategories, and 35 benchmarks. When Patient Journey Intelligence reports extraction accuracy numbers, MedHELM is the benchmark. A researcher who wants to reproduce the evaluation can run the same benchmark against the same tasks independently.

LangTest provides 100+ test types across fairness, bias, and robustness - the test library Gatekeeper draws from when building healthcare-specific test suites. Because LangTest is open source, the test definitions are auditable: you can inspect exactly what a "demographic bias" test is measuring before trusting that a model passed it.

Bias detection and fairness in production

Healthcare AI must perform equitably across every patient population it touches. A de-identification model that performs differently for one demographic group than another does not fail a general accuracy benchmark - it fails in production, at the point where the disparity has already affected a real patient record.

Guardian monitors three dimensions of performance fairness continuously:

Demographic monitoring

Tracks whether models perform differently based on age, gender, race, or ethnicity. A de-identification model more aggressive in masking information for specific ethnic groups, or an extraction model with lower recall for female patients, triggers an immediate alert - not a periodic audit.

Geographic and facility analysis

Identifies regional performance variation reflecting differences in clinical documentation practices, local disease prevalence, or regional terminology. A model trained predominantly on academic medical center data may underperform at community hospitals - Guardian surfaces this before it becomes a care quality gap.

Socioeconomic monitoring

Detects whether model performance varies based on insurance type or ZIP code-based deprivation indices. Algorithmic disparities that track socioeconomic status can exacerbate existing healthcare inequity. Guardian flags them; Governor logs the remediation path.

When Guardian detects a bias disparity, the remediation follows a governed path: Governor logs the finding against the model card, Gatekeeper gates the retrained version through bias test suites before it ships, and the full remediation cycle is auditable end to end. "We retrained the model" is not a sufficient answer to an auditor - the audit trail from detection through re-testing through redeployment is.

Governance data stays inside your environment

Pacific AI deploys exclusively inside your AWS or Azure tenant. Pacific AI has no access to your governance data, model cards, risk assessments, or test results. For enterprise deployments, the LLM used for "judge" inference - the model that evaluates other models' outputs - runs privately inside your VPC using Pacific AI's proprietary model, not a third-party API.

This closes a gap the current AI governance conversation often leaves open: sending model behavior data to an external governance platform creates its own privacy exposure. Patient Journey Intelligence's governance architecture avoids this entirely. The governance layer is as air-gapped as the clinical data layer.

Regulatory coverage: 250+ laws and frameworks, updated quarterly

The regulatory landscape for healthcare AI is changing faster than any single organization can track. Pacific AI covers 250+ AI laws, standards, and regulations with quarterly policy updates. The frameworks relevant to Patient Journey Intelligence deployments include:

AI governance sits between data governance - what the data says and where it came from - and security and access control - who can reach it. All three layers are required for a defensible clinical AI deployment. A model that is accurately governed but operating on unaudited data, or a perfectly audited dataset feeding an ungoverned model, leaves the same gap an auditor will find.


FAQ

Pacific AI is a purpose-built responsible AI platform for healthcare, certified by CHAI as an Assurance Resource Provider. The alternative - building internal governance tooling - produces self-certification: the organization validates its own models against its own standards. Pacific AI is independent. Its test suites, evaluation benchmarks, and regulatory coverage are external to John Snow Labs and auditable by any third party. For healthcare AI that feeds regulatory submissions or clinical decisions, the independence of the governance layer matters as much as the governance layer itself.

CHAI - the Coalition for Health AI - develops assurance standards for healthcare AI and certifies organizations that provide independent validation services against those standards. Pacific AI's CHAI-certified Assurance Resource Provider status means its governance methodology, bias testing approach, and model card format have been independently reviewed against CHAI's standards. For Patient Journey Intelligence deployments, CHAI compliance means the model cards and risk assessments are recognized by health system IRBs and compliance teams as meeting an independently verified standard - not a vendor-defined one.

MedHELM is an open-source clinical AI evaluation framework developed with Stanford CRFM, covering 121 real-world clinical tasks across 35 benchmarks. When Patient Journey Intelligence reports extraction accuracy - the 96% F1 figure measured against GPT-4o, AWS Comprehend Medical, and Azure - that number is measured against MedHELM tasks. Because MedHELM is open source, the evaluation is reproducible: a health system's data science team can run the same benchmark against the same task definitions independently and verify the number. Self-reported accuracy against proprietary internal benchmarks cannot be verified this way.

Both. Gatekeeper is MCP-native, which means it tests tool call sequences and multi-step agent behavior directly - not just individual model inputs and outputs. A cohort-building workflow that combines an NLP extraction step, a reasoning step, and a registry population step is tested as a system. Failures that only appear in multi-step sequences - where the output of one model becomes input to the next - are caught before the workflow ships. This is the governance gap most platforms leave open: individual models pass their tests, but the composed workflow was never tested end to end.

Guardian detects the drift - a drop in accuracy, a demographic performance disparity, an anomalous pattern in adversarial inputs - and logs it against the model's record in Governor. The finding creates a policy action item: the model must be retrained or recalibrated and the new version must pass Gatekeeper's test suites before it ships. The full cycle - detection, logging, remediation, re-testing, redeployment - is auditable. "We retrained the model after Guardian flagged the drift on this date, and the retrained version passed these specific Gatekeeper test suites before deploying" is the audit-defensible answer. "We noticed the model wasn't working well and updated it" is not.

Yes. Governor maintains audit trails with role-based access, so IRB reviewers, compliance officers, and risk managers can access model cards, impact assessments, test results, and policy records directly - without going through the clinical data layer. The governance data is separated from patient data by design. Because Pacific AI deploys inside your tenant, the governance records are under your organization's data governance policies, not Pacific AI's.