AI-Enhanced Oncology Data: Unlocking Insights from EHRs with NLP and LLMs

10.06.2025

Julio Bonis

Data Scientist at John Snow Labs

Introduction

Oncology data is inherently complex, dispersed, and often unstructured. Extracting actionable insights from Electronic Health Records (EHRs) and other clinical data sources poses a significant challenge for healthcare professionals. Traditional manual extraction methods are time-consuming, labor-intensive, and prone to errors, especially when dealing with massive datasets.

John Snow Labs and MiBA are at the forefront of addressing these challenges. By integrating Natural Language Processing (NLP) and Large Language Models (LLMs), they are unlocking valuable insights from oncology EHRs. This blog will explore how AI-enhanced data models can improve oncology data accuracy, streamline clinical trial matching, and enhance adverse event detection.

The Problem: Complex and Dispersed Oncology Data

Oncology data is often fragmented and complex, scattered across various systems in both structured and unstructured formats. Important clinical information, such as diagnosis, histology, biomarkers, and treatment responses, is frequently embedded in unstructured data like notes, PDFs, and scanned documents. At the same time, structured fields in EHRs are often incomplete or missing crucial details. This creates significant challenges in creating a comprehensive, cohesive patient profile.

Manual data extraction from these diverse sources presents additional complications. The process of reviewing thousands of patient notes is impractical, error-prone, and inefficient. Relying on manual methods leads to inconsistent data, which can leave parts of the patient journey unrecorded. When dealing with data from hundreds of thousands of patients, it’s clear that such an approach quickly becomes unsustainable without automation.

The lack of comprehensive data extraction can lead to missed opportunities for valuable insights. Critical information about disease progression, therapy outcomes, and patient responses may go unnoticed. This, in turn, makes it much harder to identify adverse events related to treatments or to match patients with relevant clinical trials, which can significantly impact clinical decision-making and patient care.

AI-Enhanced Oncology Data Model

The Role of John Snow Labs and MiBA

MiBA’s mission is to unlock the power of data to fuel innovation and improve patient care, leveraging AI for efficient and powerful insights and becoming the premier source for oncology research, clinical trials, and new developments.

John Snow Labs, in collaboration with MiBA, developed a robust AI pipeline to tackle oncology data complexity. By leveraging NLP and LLMs, they extract, structure, and analyze unstructured data from EHRs, PDFs, and clinical notes.

Key Features:

Entity Extraction: Identifies essential clinical entities such as diagnosis, biomarkers, and therapy responses.
Relationship Mapping: Establishes connections between entities (e.g., drug and adverse event relationships).
Data Integration: Combines extracted data into a unified, structured model for analysis.
NLP/LLM Hybrid Approaches: Combines NLP efficiency with LLM reasoning for improved accuracy.

Step 1: Data Ingestion and Preprocessing

The pipeline ingests data from diverse sources, including structured EHR data (e.g., diagnosis codes, lab results), unstructured data from clinical notes, PDFs, and scanned documents, Optical Character Recognition (OCR) for extracting text from scanned records.

Technology Stack:

John Snow Labs Clinical NLP Tools: For annotation and preprocessing
Azure SQL and Spark Pools: Manage large-scale data processing
MiBA Platform: Integrates structured and unstructured data, providing nightly updates for accuracy.

Step 2: Entity Extraction and Relationship Mapping

The pipeline uses pre-trained models from John Snow Labs to extract entities and relationships, achieving an average F1 score of 0.9 for entity relationships.

Named Entity Recognition (NER): Identifies clinical elements like histology, biomarkers, and therapy responses.
Relationship Mapping: Establishes links between diagnosis and therapy outcomes using NLP models tuned for biomedical texts.

Step 3: Data Structuring and Storage

The data structuring and storage process brings together information from both EHR fields and extracted unstructured data to create a unified and comprehensive dataset. This integration ensures a more complete representation of each patient’s clinical profile. To maintain accuracy, the dataset is updated nightly, reflecting the most current information available.

The impact of integrating NLP tools is significant, leading to substantial increases in key clinical data elements. After NLP integration, histology data increases by 67.5%, stage data by 19.5%, metastasis information by 39.9%, estrogen receptor data by 34.9%, and BRAF mutation data by 81.5%. These gains highlight the effectiveness of combining structured and unstructured data in improving dataset completeness and utility.

Real-World Applications

Clinical Trial Matching

Clinical trials are crucial for oncology patients, yet enrollment rates are often low, particularly in community settings. Identifying eligible patients within vast and unstructured data pools is challenging.

The integration of NLP and LLMs for clinical trial matching enhances both the accuracy and efficiency of identifying eligible patients. This system automatically extracts inclusion and exclusion criteria from clinical trial databases and harmonizes them with patient data to ensure precise matching.

This approach has demonstrated a significant improvement in performance. The NLP/LLM hybrid model achieves an F1 score of 0.81, markedly outperforming traditional methods. These results underscore the value of advanced language technologies in streamlining and optimizing the trial matching process.

Adverse Event Detection

Detecting adverse events (AEs) related to oncology therapies is critical for ensuring patient safety. However, the task is complicated by the diverse language and contextual variations found in clinical documentation, which often obscure clear identification of AE-drug relationships.

To address this challenge, a hybrid NLP/LLM pipeline has been developed, capable of accurately identifying these relationships with a high level of precision and recall. This system achieves an F1 score of 0.93, with a recall of 0.95 and a precision of 0.91, demonstrating its effectiveness in recognizing subtle and complex patterns in clinical text. For example, the model can accurately detect that Alectinib is associated with the adverse event of bradycardia, illustrating its capability to capture clinically meaningful insights that might otherwise be missed.

Advantages of the AI-Enhanced Approach

Improved data accuracy and coverage is achieved by automating the data extraction process, significantly reducing the need for manual review. This approach enables the capture of data elements that are often overlooked in structured fields, enriching datasets by as much as 80% and providing a more complete view of the patient journey.

Clinical trial enrollment is also enhanced through faster and more accurate identification of eligible patients. This streamlined process helps improve enrollment rates and reduces the likelihood of missing suitable candidates, ultimately accelerating trial timelines.

In addition, the system supports proactive adverse event management by automating the detection of therapy-related adverse events. This not only ensures better patient safety but also integrates smoothly into existing clinical workflows, enabling more efficient and timely responses.

Addressing Challenges

To control hallucinations in language model responses, the system leverages Retrieval Augmented Generation (RAG), which ensures that responses are grounded in real, relevant data. When the system retrieves insufficient data for a reliable output, it prompts the user, alerting them to the limitations in the available information.

For integrating unstructured data, hybrid models combining NLP and LLMs provide a balanced approach. This combination ensures both computational efficiency and contextual accuracy, allowing the system to process and understand data from a variety of sources.

In terms of compliance and data privacy, the system operates within HIPAA-compliant AWS and Azure environments, ensuring that all processes adhere to strict regulations. Additionally, it employs data anonymization techniques to maintain confidentiality and protect patient information. For reference, the system follows the guidelines outlined in John Snow Labs’ Compliance Documentation.

Future Directions

John Snow Labs and MiBA are working together to further enhance the capabilities of the AI-powered oncology data model. Their collaboration focuses on improving scalability and computational efficiency to support larger datasets and more complex analyses. In addition, they aim to deepen the integration of real-time data analytics to provide timely and actionable insights for clinical decision support.

Looking ahead, the partnership also seeks to broaden the application of LLMs across other oncology subfields, including rare and less-studied cancer types, thereby expanding the model’s utility and impact across a wider range of clinical scenarios.

Conclusion

By leveraging NLP and LLMs, John Snow Labs and MiBA are revolutionizing oncology data management. Automating data extraction and improving accuracy empower healthcare professionals to make data-driven decisions, ultimately enhancing patient care and research outcomes.

For more information, visit John Snow Labs.

State-of-the-Art Medical Language Models

Learn more

Julio Bonis

Data Scientist at John Snow Labs

Our additional expert:

Julio Bonis is a data scientist working on Healthcare NLP at John Snow Labs. Julio has broad experience in software development and design of complex data products within the scope of Real World Evidence (RWE) and Natural Language Processing (NLP). He also has substantial clinical and management experience – including entrepreneurship and Medical Affairs. Julio is a medical doctor specialized in Family Medicine (registered GP), has an Executive MBA – IESE, an MSc in Bioinformatics, and an MSc in Epidemiology.