was successfully added to your cart.

Multilingual Clinical NER with ONNX: New Models for Entity Extraction Across Languages

Avatar photo
Data Scientist at John Snow Labs

John Snow Labs is thrilled to introduce a powerful set of new ONNX based clinical Named Entity Recognition (NER) models for EnglishItalian, and Spanish, in its’ most recent release expanding multilingual support beyond English for structured information extraction from clinical text. These models are designed to identify core medical entities including diseases, procedures, medications, and symptoms with high accuracy and production ready performance.

Building on our robust English foundation, these models expand our clinical NLP toolkit to support Italian and Spanish, making advanced entity extraction accessible for global teams.

Healthcare NLP

Healthcare NLP (Natural Language Processing) technologies extract insights from physician records, patient histories and diagnostic reports facilitating precise diagnosis.

Being the most widely used library in the healthcare industry, John Snow Labs’ Healthcare NLP comes with 2,500+ pretrained models that are all developed & trained with latest state-of-the-art algorithms to solve real world problems in the healthcare domain at scale. To provide reliable models and tools all the time while covering edge cases in real-world data and improve how well models generalize, the datasets and models are updated and augmented on a regular basis

Clinical NER in John Snow Labs

Named Entity Recognition (NER) is a Natural Language Processing (NLP) technique that involves identifying and extracting entities from a text, such as people, organizations, locations, dates, and other types of named entities.

Spark NLP pipelines

Within the John Snow Labs ecosystem, clinical NER is implemented using transformer-based models integrated into scalable NLP pipelines. These pipelines handle document processing, tokenization, embeddings, and entity extraction in a modular way, making it possible to deploy NER consistently across batch and real time workloads.

For readers who want to explore how clinical NER works in practice without diving into model internals the following resource provides a practical overview with examples and pipelines:

Why ONNX?

ONNX (Open Neural Network Exchange) is a standardized format for representing machine learning models across frameworks and runtimes. In the context of clinical NLP, ONNX enables faster inference, reduced resource usage, and easier deployment across different environments, including cloud, on prem, and edge systems.

Available Models by Language

 

Model Name Language Entities Extracted
roberta_med_ner_onnx English MEDICATION
roberta_disease_ner_onnx English DISEASE
roberta_procedure_ner_onnx English PROCEDURE
roberta_symptom_ner_onnx English SYMPTOM
bert_token_classifier_medical_ner_it_onnx Italian MEDICATION
bert_token_classifier_disease_ner_it_onnx Italian DISEASE
bert_token_classifier_procedure_ner_it_onnx Italian PROCEDURE
roberta_disease_ner_es_onnx Spanish DISEASE
roberta_procedure_ner_es_onnx Spanish PROCEDURE
roberta_symptom_ner_es_onnx Spanish SYMPTOM

 

These models are built on transformer architectures such as BERT, RoBERTa, and XLM-RoBERTa, and are fine-tuned specifically for clinical and biomedical text. This allows them to handle domain specific terminology, abbreviations, and language specific clinical phrasing commonly found in real world medical narratives.

Multilingual Clinical Extraction

Clinical Named Entity Recognition powered by ONNX Runtime.

As demonstrated in the image above, the models are able to accurately parse and structure complex medical narratives across multiple categories.

Medical conditions such as “type 2 diabetes mellitus” and “arterial hypertension” are labeled as DISEASE, clinical interventions like “percutaneous coronary intervention” are recognized as PROCEDURE, and prescribed drugs including “metformin”, “insulin glargine”, and “aspirin” are identified as MEDICATION.

Usage examples

English Model

from johnsnowlabs import nlp, medical

# Start Spark session
spark = nlp.start()

# Build the pipeline
document_assembler = nlp.DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = nlp.Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

token_classifier = nlp.RoBertaForTokenClassification \
    .pretrained("roberta_med_ner_onnx", "en", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("ner")

converter = nlp.NerConverter() \
    .setInputCols(["document", "token", "ner"]) \
    .setOutputCol("ner_chunk")

pipeline = nlp.Pipeline(stages=[
    document_assembler, tokenizer,
    token_classifier, converter
])

# Sample clinical text
sample_text = """The patient was diagnosed with type 2 diabetes mellitus and arterial hypertension.
She underwent percutaneous coronary intervention last year. Current medications include
metformin 500mg twice daily, insulin glargine 20 units at bedtime, and aspirin 81mg daily.
She reports occasional chest pain and shortness of breath."""

data = spark.createDataFrame([[sample_text]]).toDF("text")
result = pipeline.fit(data).transform(data)

Results:

result.selectExpr("explode(ner_chunk) as chunk").selectExpr(
    "chunk.result as chunk",
    "chunk.metadata['entity'] as entity"
).show(truncate=False)

+--------------------------------------+-----------+
| chunk                                | entity    |
+--------------------------------------+-----------+
| metformin                            | MEDICATION|
| insulin glargine                     | MEDICATION|
| aspirin                              | MEDICATION|
+--------------------------------------+-----------+

Spanish Model

spanish_token_classifier = medical.MedicalBertForTokenClassifier \
    .pretrained("bert_token_classifier_medical_ner_it_onnx", "it", "clinical/models") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("ner")

pipeline = nlp.Pipeline(stages=[
    document_assembler, tokenizer,
    spanish_token_classifier, converter
])

sample_text_es = """El paciente fue diagnosticado con diabetes mellitus tipo 2 e hipertensión arterial.
Se le realizó una intervención coronaria percutánea el año pasado. Los medicamentos
actuales incluyen metformina 500mg dos veces al día, insulina glargina 20 unidades
al acostarse y aspirina 81mg diaria. Refiere dolor torácico ocasional y disnea."""

data_es = spark.createDataFrame([[sample_text_es]]).toDF("text")
result_es = pipeline_es.fit(data_es).transform(data_es)

Results:

result.selectExpr("explode(ner_chunk) as chunk").selectExpr(
    "chunk.result as chunk",
    "chunk.metadata['entity'] as entity"
).show(truncate=False)

+--------------------------------------+-----------+
| chunk                                | entity    |
+--------------------------------------+-----------+
| metformina                           | MEDICATION|
| insulina glargina                    | MEDICATION|
| aspirina                             | MEDICATION|
+--------------------------------------+-----------+

Stay tuned for upcoming releases filling the noted gaps and expanding to more languages!

For additional information, please consult the following references:

Conclusion

The introduction of multilingual clinical NER models marks a significant milestone in John Snow Labs’ mission to globalize medical data extraction. By extending high-performing entity extraction capabilities across languages, these models bridge a critical gap in non-English clinical NLP, ensuring that healthcare providers and researchers worldwide can leverage structured insights from localized medical records.

How useful was this post?

Healthcare NLP Release Notes

6.3
Avatar photo
Data Scientist at John Snow Labs
Our additional expert:
Data Scientist at John Snow Labs. Working with NLP, machine learning, and multimodal AI systems, he builds and optimizes healthcare-focused solutions for real-world data and production environments.

Reliable and verified information compiled by our editorial and professional team. John Snow Labs' Editorial Policy.

Enhanced Visibility over Preannotation Servers: Why Feedback Matters as Much as Results

Clinical NLP teams regularly deploy pre-annotation servers across multiple project types: NER for text extraction, Visual NER for document processing, classification models...
preloader