High Accuracy Resolution of Medical Entities to Standard Codes Using Novel Sentence Embeddings

11.06.2021

Veysel Kocaman

The release of Spark NLP for Healthcare 3.1 brings significant speed and accuracy improvements for the task of entity resolution, also known as entity linking: the ability to map a medical entity to a standard code. This release supports the SNOMED-CT, ICD-10-CM, ICD-10-PCS, CPT, LOINC, RxNorm, UMLS, HPO, and ICD-O terminologies. The accuracy gains have been vetted by several customers already using real-world data.

New John Snow Labs SBert Sentence Embeddings

One challenge with resolving medical entities to codes is that very often, multiple similar codes exist for a term, and the most appropriate one depends on context. For example, “bladder cancer” may be mapped to any of the following ICD-10-CM standard terms:

        Cancer in situ of urinary bladder
    
        Carcinoma in situ of bladder
        Tumor of bladder neck
    
        Neoplasm of unspecified behavior of bladder
    
        Malignant tumour of bladder neck
    
        Secondary malignant neoplasm of bladder
    
        Malignant tumor of urinary bladder
    
        Malignant neoplasm of bladder, unspecified

Ranking these options – or picking the most relevant one – depends heavily on the context. To provide a better understanding of medical context, we’ve developed a set of new healthcare-specific sentence embeddings, including the first medical sentence embeddings available in different sizes. These embeddings (and the entity resolution models that leverage them) are not availble anywhere else, and result in better accuracy than other embeddings which are now outdated such as BioBERT and ClinicalBERT, for 3 reasons:

They’re based on a newer deep learning architecture (SBERT).

We augmented the training data beyond what’s available in public academic datasets.

They’re more current, because we just retrained them. In contrast, for example, BioBERT was trained in 2019 – before any mention of COVID-19 on PubMed.

The new sBERT models delivered with the 3.1 release are fined tuned on MedNLI, NLI, and UMLS datasets with various parameters to cover common NLP tasks in medical domain:

sbiobert_jsl_cased
sbiobert_jsl_umls_cased
sbert_jsl_medium_uncased
sbert_jsl_medium_umls_uncased
sbert_jsl_mini_uncased
sbert_jsl_mini_umls_uncased
sbert_jsl_tiny_uncased
sbert_jsl_tiny_umls_uncased

6X Faster Load Times for Sentence Resolver Models

Sentence resolver models now have faster load times, with an average six-fold speedup when compared to previous versions. Also, the load process now is more memory friendly meaning that the maximum memory required during load time is lower, reducing the chances of out-of-memory exceptions, and thus relaxing hardware requirements.

John Snow Labs SBert Model Speed Benchmark

Model	Base Model	Is Cased	Train Datasets	Inference speed (100 rows)
sbiobert_jsl_cased	biobert_v1.1_pubmed	Cased	medNLI, allNLI	274,53
sbiobert_jsl_umls_cased	biobert_v1.1_pubmed	Cased	medNLI, allNLI, umls	274,52
sbert_jsl_medium_uncased	uncased_L-8_H-512_A-8	Uncased	medNLI, allNLI	80,40
sbert_jsl_medium_umls_uncased	uncased_L-8_H-512_A-8	Uncased	medNLI, allNLI, umls	78,35
sbert_jsl_mini_uncased	uncased_L-4_H-256_A-4	Uncased	medNLI, allNLI	10,68
sbert_jsl_mini_umls_uncased	uncased_L-4_H-256_A-4	Uncased	medNLI, allNLI, umls	10,29
sbert_jsl_tiny_uncased	uncased_L-2_H-128_A-2	Uncased	medNLI, allNLI	4,54
sbert_jsl_tiny_umls_uncased	uncased_L-2_H-128_A-2	Uncased	medNLI, allNL, umls	4,54

Higher Accuracy ICD-10-CM Resolver Models

These models map clinical entities and concepts to ICD-10-CM codes using SBERT sentence embeddings. They also return the official resolution text within the brackets inside the metadata. Both models are augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths).

sbiobertresolve_icd10cm_slim_billable_hcc: Trained with classic sbiobert mli. (sbiobert_base_cased_mli)

Models Hub Page:
https://nlp.johnsnowlabs.com/2021/05/25/sbiobertresolve_icd10cm_slim_billable_hcc_en.html

sbertresolve_icd10cm_slim_billable_hcc_med: Trained with new jsl sbert(sbert_jsl_medium_uncased)

Models Hub Page:
https://nlp.johnsnowlabs.com/2021/05/25/sbertresolve_icd10cm_slim_billable_hcc_med_en.html

Example: ‘bladder cancer’

sbiobertresolve_icd10cm_augmented_billable_hcc

chunks	code	all_codes	resolutions	all_distances	100x Loop(sec)
bladder cancer	C679	[C679, Z126, D090, D494, C7911]	[bladder cancer, suspected bladder cancer, cancer in situ of urinary bladder, tumor of bladder neck, malignant tumour of bladder neck]	[0.0000, 0.0904, 0.0978, 0.1080, 0.1281]	26,9

sbiobertresolve_icd10cm_slim_billable_hcc

chunks	code	all_codes	resolutions	all_distances	100x Loop(sec)
bladder cancer	D090	[D090, D494, C7911, C680, C679]	[cancer in situ of urinary bladder [Carcinoma in situ of bladder], tumor of bladder neck [Neoplasm of unspecified behavior of bladder], malignant tumour of bladder neck [Secondary malignant neoplasm of bladder], carcinoma of urethra [Malignant neoplasm of urethra], malignant tumor of urinary bladder [Malignant neoplasm of bladder, unspecified]]	[0.0978, 0.1080, 0.1281, 0.1314, 0.1284]	20,9

sbertresolve_icd10cm_slim_billable_hcc_med

chunks	code	all_codes	resolutions	all_distances	100x Loop(sec)
bladder cancer	C671	[C671, C679, C61, C672, C673]	[bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the urinary bladder [Malignant neoplasm of bladder, unspecified], prostate cancer [Malignant neoplasm of prostate], cancer of the urinary bladder]	[0.0894, 0.1051, 0.1184, 0.1180, 0.1200]	12,8

Get Started

Veysel Kocaman

Our additional expert:

Veysel is the Chief Technology Officer at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science. Holding a PhD degree in ML, Dr. Kocaman has authored more than 25 papers in peer reviewed journals and conferences in the last few years, focusing on solving real world problems in healthcare with NLP. He is a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence, and big data with over ten years of experience. Veysel has broad consulting experience in Statistics, Data Science, Software Architecture, DevOps, Machine Learning, and AI to several start-ups, boot camps, and companies around the globe. He also speaks at Data Science & AI events, conferences and workshops, and has delivered more than a hundred talks at international as well as national conferences and meetups.

The Current State Of The Healthcare AI Revolution

David Talby

Artificial intelligence (AI) is poised to change the healthcare and life sciences industry in ways we couldn’t have imagined only years ago....