was successfully added to your cart.

AnnouncementNatural Language Processing

High Accuracy Resolution of Medical Entities to Standard Codes Using Novel Sentence Embeddings

By June 11, 2021No Comments

The release of Spark NLP for Healthcare 3.1 brings significant speed and accuracy improvements for the task of entity resolution, also known as entity linking: the ability to map a medical entity to a standard code. This release supports the SNOMED-CT, ICD-10-CM, ICD-10-PCS, CPT, LOINC, RxNorm, UMLS, HPO, and ICD-O terminologies. The accuracy gains have been vetted by several customers already using real-world data. 



New John Snow Labs SBert Sentence Embeddings 

One challenge with resolving medical entities to codes is that very often, multiple similar codes exist for a term, and the most appropriate one depends on context. For example, “bladder cancer” may be mapped to any of the following ICD-10-CM standard terms: 


Cancer in situ of urinary bladder 

Carcinoma in situ of bladder 
Tumor of bladder neck 

Neoplasm of unspecified behavior of bladder 

Malignant tumour of bladder neck 

Secondary malignant neoplasm of bladder 

Malignant tumor of urinary bladder 

Malignant neoplasm of bladder, unspecified 


Ranking these options – or picking the most relevant one – depends heavily on the context. To provide a better understanding of medical context, we’ve developed a set of new healthcare-specific sentence embeddings, including the first medical sentence embeddings available in different sizes. These embeddings (and the entity resolution models that leverage them) are not availble anywhere else, and result in better accuracy than other embeddings which are now outdated such as BioBERT and ClinicalBERT, for 3 reasons: 

  • They’re based on a newer deep learning architecture (SBERT). 
  • We augmented the training data beyond what’s available in public academic datasets. 
  • They’re more current, because we just retrained them. In contrast, for example, BioBERT was trained in 2019 – before any mention of COVID-19 on PubMed. 


The new sBERT models delivered with the 3.1 release are fined tuned on MedNLI, NLI, and UMLS datasets with various parameters to cover common NLP tasks in medical domain: 

  • sbiobert_jsl_cased 
  • sbiobert_jsl_umls_cased 
  • sbert_jsl_medium_uncased 
  • sbert_jsl_medium_umls_uncased 
  • sbert_jsl_mini_uncased 
  • sbert_jsl_mini_umls_uncased 
  • sbert_jsl_tiny_uncased
  • sbert_jsl_tiny_umls_uncased 



6X Faster Load Times for Sentence Resolver Models 

Sentence resolver models now have faster load times, with an average six-fold speedup when compared to previous versions. Also, the load process now is more memory friendly meaning that the maximum memory required during load time is lower, reducing the chances of out-of-memory exceptions, and thus relaxing hardware requirements. 



John Snow Labs SBert Model Speed Benchmark 

Model  Base Model  Is Cased  Train Datasets  Inference speed (100 rows) 
sbiobert_jsl_cased  biobert_v1.1_pubmed  Cased  medNLI, allNLI  274,53 
sbiobert_jsl_umls_cased  biobert_v1.1_pubmed  Cased  medNLI, allNLI, umls  274,52 
sbert_jsl_medium_uncased  uncased_L-8_H-512_A-8  Uncased  medNLI, allNLI  80,40 
sbert_jsl_medium_umls_uncased  uncased_L-8_H-512_A-8  Uncased  medNLI, allNLI, umls  78,35 
sbert_jsl_mini_uncased  uncased_L-4_H-256_A-4  Uncased  medNLI, allNLI  10,68 
sbert_jsl_mini_umls_uncased  uncased_L-4_H-256_A-4  Uncased  medNLI, allNLI, umls  10,29 
sbert_jsl_tiny_uncased  uncased_L-2_H-128_A-2  Uncased  medNLI, allNLI  4,54 
sbert_jsl_tiny_umls_uncased  uncased_L-2_H-128_A-2  Uncased  medNLI, allNL, umls  4,54 



Higher Accuracy ICD-10-CM Resolver Models 

These models map clinical entities and concepts to ICD-10-CM codes using SBERT sentence embeddings. They also return the official resolution text within the brackets inside the metadata. Both models are augmented with synonyms, and previous augmentations are flexed according to cosine distances to unnormalized terms (ground truths). 

  • sbiobertresolve_icd10cm_slim_billable_hcc: Trained with classic sbiobert mli. (sbiobert_base_cased_mli) 

Models Hub Page:

  • sbertresolve_icd10cm_slim_billable_hcc_med: Trained with new jsl sbert(sbert_jsl_medium_uncased) 

Models Hub Page:

Example: ‘bladder cancer’ 


  • sbiobertresolve_icd10cm_augmented_billable_hcc 
chunks  code  all_codes  resolutions  all_distances  100x Loop(sec) 
bladder cancer  C679  [C679, Z126, D090, D494, C7911]  [bladder cancer, suspected bladder cancer, cancer in situ of
urinary bladder, tumor of bladder neck, malignant tumour of
bladder neck] 
[0.0000, 0.0904, 0.0978, 0.1080, 0.1281]  26,9 
  • sbiobertresolve_icd10cm_slim_billable_hcc 
chunks  code  all_codes  resolutions  all_distances  100x Loop(sec) 
bladder cancer  D090  [D090, D494, C7911, C680, C679]  [cancer in situ of urinary bladder [Carcinoma in situ of bladder],
tumor of bladder neck [Neoplasm of unspecified behavior of
bladder], malignant tumour of bladder neck [Secondary malignant
neoplasm of bladder], carcinoma of urethra [Malignant neoplasm
of urethra], malignant tumor of urinary bladder [Malignant neoplasm
of bladder, unspecified]] 
[0.0978, 0.1080, 0.1281, 0.1314, 0.1284]  20,9 
  • sbertresolve_icd10cm_slim_billable_hcc_med 
chunks  code  all_codes  resolutions  all_distances  100x Loop(sec) 
bladder cancer  C671  [C671, C679, C61, C672, C673]  [bladder cancer, dome [Malignant neoplasm of dome of bladder], cancer of the
urinary bladder [Malignant neoplasm of bladder, unspecified], prostate cancer
 [Malignant neoplasm of prostate], cancer of the urinary bladder] 
[0.0894, 0.1051, 0.1184, 0.1180, 0.1200]  12,8 



Get Started