Meet our team at BioTechX Europe in Basel on the 9-10 October 2024, booth 724. Schedule a meeting with our team HERE.
was successfully added to your cart.

2600+ New Models for 200+ Languages and 10+ Dimension Reduction Algorithms for Streamlit Word-Embedding visualizations in 3-D with NLU

Avatar photo
Senior Data Scientist at John Snow Labs

We are extremely excited to announce the release of NLU 3.1! This is our biggest release so far and it comes with over 2600+ new models in 200+ languages, including DistilBERT, RoBERTa, and XLM-RoBERTa and Huggingface based Embeddings from the incredible Spark-NLP 3.1.0 release, new Streamlit Visualizations for visualizing Word Embeddings in 3-D, 2-D, and 1-D, new Healthcare pipelines for healthcare code mappings and finally confidence extraction for open source NER models. Additionally, the NLU Namespace has been renamed to the NLU Spellbook, to reflect the magicalness of each 1-liners represented by them!

Streamlit Word Embedding visualization via Manifold and Matrix Decomposition algorithms

function pipe.viz_streamlit_word_embed_manifold

Visualize Word Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 11 Supported methods from Manifold Algorithms and Matrix Decomposition Algorithms. Additionally, you can color the lower dimensional points with a label that has been previously assigned to the text by specifying a list of nlu references in the additional_classifiers_for_coloring parameter.

texts = ['You can visualize any of the 100 + embeddings','with 10+ dimension reduction algorithms',<br>'and view the results in 3D, 2D, and 1D which can be colored by various classifier labels!',]
nlu.load('bert').viz_streamlit_word_embed_manifold(default_texts=texts)

Dimension reduction techniques applied to BERT embeddings to view them in 1-D, 2-D and 3-D

Function parameters pipe.viz_streamlit_word_embed_manifold

Argument Type Default Description
default_texts List[str] (“Donald Trump likes to party!”, “Angela Merkel likes to party!”, ‘Peter HATES TO PARTTY!!!! :(‘) List of strings to apply classifiers, embeddings, and manifolds to.
text Optional[str] 'Billy likes to swim' Text to predict classes for.
sub_title Optional[str] “Apply any of the 11 Manifold or Matrix Decompositionalgorithms to reduce the dimensionality of Word Embeddingsto 1-D, 2-D and 3-D Sub title of the Streamlit app
default_algos_to_apply List[str] ["TSNE", "PCA"] A list Manifold and Matrix Decomposition Algorithms to apply. Can be either 'TSNE','ISOMAP','LLE','Spectral Embedding', 'MDS','PCA','SVD aka LSA','DictionaryLearning','FactorAnalysis','FastICA'or 'KernelPCA',
target_dimensions List[int] (1,2,3) Defines the target dimension embeddings will be reduced to
show_algo_select bool True Show selector for Manifold and Matrix Decomposition Algorithms
show_embed_select bool True Show selector for Embedding Selection
show_color_select bool True Show selector for coloring plots
MAX_DISPLAY_NUM int 100 Cap maximum number of Tokens displayed
display_embed_information bool True Show additional embedding information like dimension, nlu_reference, spark_nlp_reference, sotrage_reference, modelhub link and more.
set_wide_layout_CSS bool True Whether to inject custom CSS or not.
num_cols int 2 How many columns should for the layout in streamlit when rendering the similarity matrixes.
key str "NLU_streamlit" Key for the Streamlit elements drawn
additional_classifiers_for_coloring List[str] ['pos', 'sentiment.imdb'] List of additional NLU references to load for generting hue colors
show_model_select bool True Show a model selection dropdowns that makes any of the 1000+ models avaiable in 1 click
model_select_position str 'side' Whether to output the positions of predictions or not, see pipe.predict(positions=true) for more info
show_logo bool True Show logo
display_infos bool False Display additonal information about ISO codes and the NLU namespace structure.
n_jobs Optional[int] 3 False

Larger Example showcasing more dimension reduction techniques on a larger corpus:

See the Matrix movie script in 3-D from the perspective of BERT or any other Transformer and Embedding!

Supported Manifold Algorithms

Supported Matrix Decomposition Algorithms

New Healthcare Pipelines Pipelines

Five new healthcare code mapping pipelines:

  • nlu.load(en.resolve.icd10cm.umls): This pretrained pipeline maps ICD10CM codes to UMLS codes without using any text data. You’ll just feed white space-delimited ICD10CM codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'icd10cm': ['M89.50', 'R82.2', 'R09.01'],'umls': ['C4721411', 'C0159076', 'C0004044']}

  • nlu.load(en.resolve.mesh.umls): This pretrained pipeline maps MeSH codes to UMLS codes without using any text data. You’ll just feed white space-delimited MeSH codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'mesh': ['C028491', 'D019326', 'C579867'],'umls': ['C0970275', 'C0886627', 'C3696376']}

  • nlu.load(en.resolve.rxnorm.umls): This pretrained pipeline maps RxNorm codes to UMLS codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'rxnorm': ['1161611', '315677', '343663'],'umls': ['C3215948', 'C0984912', 'C1146501']}

  • nlu.load(en.resolve.rxnorm.mesh): This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping.

{'rxnorm': ['1191', '6809', '47613'],'mesh': ['D001241', 'D008687', 'D019355']}

  • nlu.load(en.resolve.snomed.umls): This pretrained pipeline maps SNOMED codes to UMLS codes without using any text data. You’ll just feed white space-delimited SNOMED codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'snomed': ['733187009', '449433008', '51264003'],'umls': ['C4546029', 'C3164619', 'C0271267']}

New Healthcare Pipelines

.
NLU Reference Spark NLP Reference
en.resolve.icd10cm.umls icd10cm_umls_mapping
en.resolve.mesh.umls mesh_umls_mapping
en.resolve.rxnorm.umls rxnorm_umls_mapping
en.resolve.rxnorm.mesh rxnorm_mesh_mapping
en.resolve.snomed.umls snomed_umls_mapping
en.explain_doc.carp explain_clinical_doc_carp
en.explain_doc.era explain_clinical_doc_era

New Open Source Models and Pipelines

.
nlu.load() Refrence Spark NLP Refrence
en.embed.distilbert distilbert_base_cased
en.embed.distilbert.base distilbert_base_cased
en.embed.distilbert.base.uncased distilbert_base_uncased
en.embed.distilroberta distilroberta_base
en.embed.roberta roberta_base
en.embed.roberta.base roberta_base
en.embed.roberta.large roberta_large
xx.marian opus_mt_en_fr
xx.embed.distilbert. distilbert_base_multilingual_cased
xx.embed.xlm xlm_roberta_base
xx.embed.xlm.base xlm_roberta_base
xx.embed.xlm.twitter twitter_xlm_roberta_base
zh.embed.bert bert_base_chinese
zh.embed.bert.wwm chinese_bert_wwm
de.embed.bert bert_base_german_cased
de.embed.bert.uncased bert_base_german_uncased
nl.embed.bert bert_base_dutch_cased
it.embed.bert bert_base_italian_cased
tr.embed.bert bert_base_turkish_cased
tr.embed.bert.uncased bert_base_turkish_uncased

More

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark==3.0.3

How useful was this post?

Avatar photo
Senior Data Scientist at John Snow Labs
Our additional expert:
Christian Kasim Loan is a computer scientist with over 10 years of coding experience who works for John Snow Labs as a Senior Data Scientist where he helps porting the latest and greatest Machine Learning Models to Spark and created the NLU library.

All models from the NLP Models Hub are available for preannotation of text documents in the Annotation Lab

A new generation of the NLP Lab is now available: the Generative AI Lab. Check details here https://www.johnsnowlabs.com/nlp-lab/ We are very excited...
preloader