Extracting Oncology Insights with John Snow Labs’ Medical Language Models

06.06.2025

David Talby

Chief technology officer at John Snow Labs

In today’s data-rich healthcare landscape, oncology remains among the most complex and information-dense domains. Electronic health records (EHRs), pathology reports, radiology narratives, and clinical trial documents contain vital insights about cancer diagnoses, staging, biomarkers, treatment plans, and outcomes—yet much of this information lives as unstructured text. To convert free-text narratives into actionable data, oncology teams need tools tailored to the domain’s unique vocabulary, patterns, and regulatory requirements. John Snow Labs’ Medical Language Models offer a comprehensive suite of components—named entity recognition, relation extraction, negation and assertion detection, ontology mapping, and more—optimized for oncology. In this post, we’ll explore the challenges of oncology information extraction (IE), review core Medical Language Models capabilities, compare specialized Medical Language Models against relying solely on a general-purpose LLM, and demonstrate how to build an end-to-end pipeline that transforms unstructured oncology notes into structured, analysis-ready datasets.

Why Oncology Information Extraction Is Challenging

Information extraction in oncology is particularly problematic because of:

Complex, Specialized Vocabulary
Cancer-related terminology spans thousands of tumor types (e.g., “invasive ductal carcinoma,” “stage IIIB non-small cell lung carcinoma”), molecular biomarkers (e.g., “HER2-positive,” “EGFR exon 19 deletion”), and treatment regimens (e.g., “FOLFOX,” “R-CHOP”). Acronyms proliferate (“CR,” “PR,” “PD”), and lab values (e.g., “CA-125: 45 U/mL”) require interpretation in context.
Unstructured, Heterogeneous Source Texts
Pathology reports, radiology impressions, and oncologist progress notes follow different templates. Abbreviations like “pt” for patient or “SOB” for shortness of breath vary by clinician.
Critical Negation & Uncertainty
Phrases like “No evidence of metastasis to lymph nodes” or “Patient denies chest pain” must be flagged as negative findings. Uncertainty qualifiers (“possible recurrence,” “likely benign lesion”) require handling.
Co-referencing & Temporal Context
Reports reference prior studies or treatments (“tumor measured 3 cm on CT in January 2023”). Linking pronouns (“the lesion”) back to correct context is essential.
High Regulatory & Quality Standards
Oncology data feed into decision support, trial eligibility, and registries. Errors can lead to misclassification of stage or missed adverse events. Data must be auditable and extremely accurate.

Overview of John Snow Labs’ Medical Language Models for Oncology

Oncology-Specific Named Entity Recognition (NER): Identify tumor types, staging, biomarkers, therapies, and prognosis phrases. Complement with general NER for medications and procedures.
Assertion & Negation Detection: Label entities as present, negated, conditional, or historical.
Relation Extraction & Normalization: Link entities (e.g., Tumor → Stage, Biomarker → Value) and map to codes (ICD-O, SNOMED CT).
Section & Document Segmentation: Detect headings like “Diagnosis” or “Immunohistochemistry” and split reports into labeled blocks.
Terminology & Ontology Integration: Integrate with SNOMED CT, RXNORM, UMLS, and NCI Thesaurus for standardized coding.
Customizability & Extendibility: Fine-tune models on local data, add new tumor subtypes or proprietary drug names.
Integration of John Snow Labs’ Medical LLM for Custom Field Extraction: Invoke a Medical LLM to extract novel oncology fields (e.g., new biomarkers) via prompting when no pre-trained model exists.
LLM-Assisted Reasoning & Inference: Use the Medical LLM to infer implicit information (e.g., performance status) by reasoning over narratives.

Which Core Components Are Essential for an Oncology Information Extraction Pipeline?

1. Document Import & Segmentation
• PDF & Image Support (OCR): If source notes are scans (e.g., scanned pathology PDFs), run OCR first to extract text.
• Section Splitter: Detect headings like “Diagnosis,” “Immunohistochemistry,” or “Treatment Plan” and split each report into labeled sections.

2. Text Preprocessing
• Tokenizer: A clinically tuned tokenizer that preserves medically relevant constructs (e.g., keeping “T2N0M0” as a single token).
• Sentence Detector: A model that understands medical abbreviations (e.g., “C.O.P.D.”) so sentences aren’t split inappropriately.
• Part-of-Speech Tagging (optional): Helpful if you plan to derive grammatical relations or co-reference resolution.

3. Named Entity Recognition (NER)
• Oncology NER Model: A multi-class, deep-learning NER that returns spans labeled as TUMOR_TYPE, STAGE, BIOMARKER, THERAPY, PROGNOSIS, etc.
• General Clinical NER Model: Complement with MEDICATION, PROCEDURE, ANATOMICAL_SITE, LAB_TEST to capture ancillary data.

4. Assertion & Negation Classification
• Assertion Detection: Annotate each entity span as Present, Negated, Historical, or Conditional. Ensures “ER-negative” under a phrase like “No ER positivity detected” is correctly labeled as negated.

5. Entity Normalization
• NER Normalizer: Map raw cancer mentions to standardized codes (e.g., “Stage IIIB NSCLC” → SNOMED CT: 26294004).
• Gazetteer Lookup: Supplement NER with custom lists (e.g., local trial identifiers, proprietary drug names).

6. Relation Extraction (RE)
• Relation Detection: Identify relationships such as (TUMOR_TYPE, “has_stage”, STAGE) or (BIOMARKER, “has_value”, VALUE) or (DRUG_NAME, “has_dose”, DOSE).
• Outputs structured triples that link entities, for example:

{
  "entity1": "Breast lobular carcinoma",
  "relation": "has_stage",
  "entity2": "Stage IIIC"
}

7. Post-Processing & Aggregation
• Custom Rules/UDFs: Consolidate all NER + RE outputs into a flattened JSON or table schema (e.g., columns: patient_id, tumor_type, stage, ER_status, HER2_status, treatment_plan).
• De-duplication & Co-reference Resolution: If “the tumor” appears later without repeating the full name, group mentions under the same concept for each patient.

Which Real-World Use Cases Demonstrate the Power of Oncology Information Extraction?

1. Automated Tumor Registry Population
Context: A cancer center must populate a tumor registry with structured data—primary site, histology, grade, stage, biomarker status—from pathology and clinical notes.
Pipeline Highlights:
– OCR + Oncology NER on scanned pathology PDFs to extract “invasive ductal carcinoma,” normalized to ICD-O codes.
– NER + Assertion Detection on clinical notes to capture “ER-positive, PR-negative, HER2-negative.”
– Relation Extraction links each receptor status to the correct tumor mention.
– The combined structured output is ingested directly into the central registry schema, reducing manual chart review by over 70%.

2. Identifying Clinical Trial Cohorts
Context: An academic hospital seeks patients with “KRAS G12C mutation–positive non-small cell lung carcinoma (NSCLC)” to enroll in a targeted therapy trial.
Pipeline Highlights:
– Molecular Pathology Reports (OCR if needed) → Oncology NER identifies (GENE_MUTATION = “KRAS G12C”) and (TUMOR_TYPE = “NSCLC”).
– Assertion Classification confirms that “mutation detected” is affirmative.
– Relation Extraction links the gene mutation mention to “NSCLC.”
– A downstream query filters for patients with ene_mutation = KRAS G12C AND tumor_type = NSCLC AND stage IN (“IV”, “III”).
– A nightly roster of eligible patients is generated for the trial team instead of manual review.

3. Treatment Quality Monitoring
Context: The oncology quality improvement (QI) team wants to flag instances where recommended treatment deviates from guidelines—in particular, “Stage III colon cancer” should receive adjuvant FOLFOX; any note stating “no adjuvant therapy planned” for an eligible patient is flagged.
Pipeline Highlights:
– NER extracts (TUMOR_TYPE = “Colon adenocarcinoma”) and (STAGE = “Stage III”).
– Relation Extraction maps (TUMOR_TYPE → STAGE).
– NER also captures (TREATMENT_PLAN = “no adjuvant therapy”).
– A rule-based filter:

IF tumor_type CONTAINS "colon"
   AND stage = "Stage III"
   AND treatment_plan CONTAINS "no adjuvant"
THEN flag_for_QI

– QI team reviews only flagged cases rather than scanning hundreds of charts.

4. Adverse Event & Toxicity Surveillance
Context: Oncology practices need to capture chemotherapy toxicities documented in progress notes, like “Grade 3 neutropenia,” “febrile neutropenia,” or “Grade 2 peripheral neuropathy,” to update patient management and report safety signals.
Pipeline Highlights:
– Oncology NER identifies (AE_TYPE = “neutropenia”), (AE_GRADE = “Grade 3”).
– Relation Extraction chains them into (AE_TYPE, AE_GRADE) pairs.
– Assertion Classification ensures “no evidence of neutropenia” is not counted.
– Aggregated AE profiles trigger alerts if repeated high-grade events occur—alerting clinicians to adjust therapy.

5. Biomarker-Driven Outcomes Research
Context: A research group studies outcomes for “ALK-positive lung cancer” vs. “EGFR-mutated lung cancer.” They need to stratify historical cohorts by mutation status and treatment regimens.
Pipeline Highlights:
– NER + Normalization extract (BIOMARKER = “ALK fusion”) or (BIOMARKER = “EGFR L858R”) from genomic/pathology reports.
– Relation Extraction ties (BIOMARKER → TUMOR_ID), labeling each patient’s tumor record.
– Downstream Analytics join structured NER results with EHR demographics, therapy start dates, and survival outcomes to generate Kaplan–Meier curves by cohort.

Comparing Specialized Medical Language Models vs. LLM-Only Approaches

Recently, there’s been growing interest in using LLMs like GPT-4 for zero-shot oncology IE—prompting with “Extract tumor type, stage, biomarkers, and treatment.” However, benchmarks show LLMs alone lag behind specialized pipelines. Key comparisons:

1. Zero-Shot LLMs Struggle with Complex Entity/Relation Extraction
• On breast cancer histopathology reports, GPT-4 zero-shot achieved 0.51 F1 vs. human annotator 0.68 F1 (Varlamova et al., “Leveraging large language models for structured information extraction from pathology reports”).
• Specialized models on CACER achieved 0.88 F1 on event extraction and 0.65 F1 on relation extraction vs. GPT-4’s <0.50 F1 (Fu et al., “CACER: Clinical Concept Annotations for Cancer Events and Relations”).

2. Entity Granularity & Ontology Mapping
• LLMs identify “breast carcinoma” but falter on subtypes like “grade 3 invasive ductal carcinoma.”
• LLMs inconsistently map to ontologies; MLM maps to SNOMED CT/ICD-O with >95% precision.

3. Assertion/Negation & Section-Aware Extraction
• LLMs miss nested negations (“no evidence of metastatic disease”).
• MLM’s assertion classifiers handle negation with >90% accuracy.

4. Relation Extraction Stability
• LLMs hallucinate relations when outputting JSON triples.
• MLM relation models yield 0.80 F1 vs. GPT-4’s <0.60 F1 (Jabal et al., “Language Models and Retrieval Augmented Generation for Automated Structured Data Extraction from Diagnostic Reports”).

5. Consistency & Auditing
• Regulators require deterministic, reproducible outputs. LLM outputs vary by run.
• MLM pipelines produce versioned, auditable annotations.

6. Performance on Large-Scale Datasets
• GPT-4 achieved ~82% accuracy on Japanese lung cancer radiology reports; recall dropped on rare variants.
• MLM models achieved >94% accuracy on the same dataset (Fu et al., “Clinical Concept Annotations for Cancer Events and Relations”).

In summary, LLMs alone are not reliable enough for regulatory-grade oncology IE. MLMs combining NER, assertion detection, relation extraction, and ontology mapping consistently outperform LLMs.

Incorporating John Snow Labs’ Medical LLM for Customization and Reasoning
John Snow Labs’ medical LLMs can be integrated to:
– Enable Custom, Prompt-Based Extraction: For novel entities, the LLM can be prompted rather than retraining.
– Provide Inference & Reasoning: Infer implicit fields by reasoning across sentences (e.g., performance status).

Best Practices for Implementing an Oncology IE Pipeline

1. Start with a Representative, Labeled Corpus: Fine-tune on local data to boost precision and recall. Build a small gold standard (100–200 notes).

2. Segment Clinical Documents Early: Use section splitting to isolate “Diagnosis” and “Treatment Plan” blocks.

3. Chain Entity & Relation Models Thoughtfully: Run NER first, then relation extraction on annotated spans.

4. Leverage Assertion Detection to Filter True Positives: Exclude negated or historical mentions.

5. Normalize to Standard Ontologies: Map to SNOMED CT, ICD-O, or UMLS for interoperability.

6. Build Monitoring & Feedback Loops: Track metrics, sample for manual review, and fine-tune models over time.

7. Ensure Compliance & Security: Run in a HIPAA-compliant environment with role-based access control.

Conclusion

Accurate oncology IE underpins clinical decision support, trial recruitment, registries, quality monitoring, and research. While LLMs like GPT-4 offer a zero-shot starting point, they underperform on domain-specific tasks (entity/relation extraction, negation, ontology mapping). John Snow Labs’ Medical Language Models deliver specialized, scalable solutions for oncology narratives.

By leveraging pre-trained oncology NER, assertion detection, relation extraction, integrated ontology mapping, and optionally integrating medical LLMs for custom extraction and reasoning, teams can transform unstructured oncology notes into highly accurate, auditable datasets.

Whether building a cancer registry, identifying biomarker-driven cohorts, or optimizing EHR data for decision support, Medical Language Models provide the precision, recall, and reproducibility required for regulatory-grade applications.

Disclaimer

John Snow Labs’ Medical Language Models are provided for research and development. Always validate extracted data against a gold-standard annotation set before using in clinical decision-making or regulatory submissions.

Oncology - Clinical NLP Demos & Notebooks

See in action

David Talby

Chief technology officer at John Snow Labs

Our additional expert:

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP – the world’s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Applications of Generative AI in healthcare

Generative AI in Healthcare: Use Cases, Benefits, and Challenges

Ida Lucente

Generative AI in healthcare is a transformative technology that utilizes advanced algorithms to synthesize and analyze medical data, facilitating personalized and efficient...