2600+ New Models for 200+ Languages and 10+ Dimension Reduction Algorithms for Streamlit Word-Embedding visualizations in 3-D with NLU

06.07.2021

Christian Kasim Loan

Senior Data Scientist at John Snow Labs

We are extremely excited to announce the release of NLU 3.1! This is our biggest release so far and it comes with over 2600+ new models in 200+ languages, including DistilBERT, RoBERTa, and XLM-RoBERTa and Huggingface based Embeddings from the incredible Spark-NLP 3.1.0 release, new Streamlit Visualizations for visualizing Word Embeddings in 3-D, 2-D, and 1-D, new Healthcare pipelines for healthcare code mappings and finally confidence extraction for open source NER models. Additionally, the NLU Namespace has been renamed to the NLU Spellbook, to reflect the magicalness of each 1-liners represented by them!

Streamlit Word Embedding visualization via Manifold and Matrix Decomposition algorithms

function `pipe.viz_streamlit_word_embed_manifold`

Visualize Word Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 11 Supported methods from Manifold Algorithms and Matrix Decomposition Algorithms. Additionally, you can color the lower dimensional points with a label that has been previously assigned to the text by specifying a list of nlu references in the additional_classifiers_for_coloring parameter.

Reduces Dimensionality of high dimensional Word Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
Applicable with any of the 100+ Word Embedding models
Color points by classifying with any of the 100+ Parts of Speech Classifiers or Document Classifiers
Generates NUM-DIMENSIONS * NUM-EMBEDDINGS * NUM-DIMENSION-REDUCTION-ALGOS plots

texts = ['You can visualize any of the 100 + embeddings','with 10+ dimension reduction algorithms',<br>'and view the results in 3D, 2D, and 1D which can be colored by various classifier labels!',]
nlu.load('bert').viz_streamlit_word_embed_manifold(default_texts=texts)

Dimension reduction techniques applied to BERT embeddings to view them in 1-D, 2-D and 3-D

Function parameters `pipe.viz_streamlit_word_embed_manifold`

Argument	Type	Default	Description
`default_texts`	`List[str]`	(“Donald Trump likes to party!”, “Angela Merkel likes to party!”, ‘Peter HATES TO PARTTY!!!! :(‘)	List of strings to apply classifiers, embeddings, and manifolds to.
`text`	`Optional[str]`	`'Billy likes to swim'`	Text to predict classes for.
`sub_title`	`Optional[str]`	“Apply any of the 11 `Manifold` or `Matrix Decomposition`algorithms to reduce the dimensionality of `Word Embeddings`to `1-D`, `2-D` and `3-D` “	Sub title of the Streamlit app
`default_algos_to_apply`	`List[str]`	`["TSNE", "PCA"]`	A list Manifold and Matrix Decomposition Algorithms to apply. Can be either `'TSNE'`,`'ISOMAP'`,`'LLE'`,`'Spectral Embedding'`, `'MDS'`,`'PCA'`,`'SVD aka LSA'`,`'DictionaryLearning'`,`'FactorAnalysis'`,`'FastICA'`or `'KernelPCA'`,
`target_dimensions`	`List[int]`	`(1,2,3)`	Defines the target dimension embeddings will be reduced to
`show_algo_select`	`bool`	`True`	Show selector for Manifold and Matrix Decomposition Algorithms
`show_embed_select`	`bool`	`True`	Show selector for Embedding Selection
`show_color_select`	`bool`	`True`	Show selector for coloring plots
`MAX_DISPLAY_NUM`	`int`	`100`	Cap maximum number of Tokens displayed
`display_embed_information`	`bool`	`True`	Show additional embedding information like `dimension`, `nlu_reference`, `spark_nlp_reference`, `sotrage_reference`, `modelhub link` and more.
`set_wide_layout_CSS`	`bool`	`True`	Whether to inject custom CSS or not.
`num_cols`	`int`	`2`	How many columns should for the layout in streamlit when rendering the similarity matrixes.
`key`	`str`	`"NLU_streamlit"`	Key for the Streamlit elements drawn
`additional_classifiers_for_coloring`	`List[str]`	`['pos', 'sentiment.imdb']`	List of additional NLU references to load for generting hue colors
`show_model_select`	`bool`	`True`	Show a model selection dropdowns that makes any of the 1000+ models avaiable in 1 click
`model_select_position`	`str`	`'side'`	Whether to output the positions of predictions or not, see `pipe.predict(positions=true`) for more info
`show_logo`	`bool`	`True`	Show logo
`display_infos`	`bool`	`False`	Display additonal information about ISO codes and the NLU namespace structure.
`n_jobs`	`Optional[int]`	`3`	`False`

Larger Example showcasing more dimension reduction techniques on a larger corpus:

See the Matrix movie script in 3-D from the perspective of BERT or any other Transformer and Embedding!

Supported Manifold Algorithms

Supported Matrix Decomposition Algorithms

New Healthcare Pipelines Pipelines

Five new healthcare code mapping pipelines:

nlu.load(en.resolve.icd10cm.umls): This pretrained pipeline maps ICD10CM codes to UMLS codes without using any text data. You’ll just feed white space-delimited ICD10CM codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'icd10cm': ['M89.50', 'R82.2', 'R09.01'],'umls': ['C4721411', 'C0159076', 'C0004044']}

nlu.load(en.resolve.mesh.umls): This pretrained pipeline maps MeSH codes to UMLS codes without using any text data. You’ll just feed white space-delimited MeSH codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'mesh': ['C028491', 'D019326', 'C579867'],'umls': ['C0970275', 'C0886627', 'C3696376']}

nlu.load(en.resolve.rxnorm.umls): This pretrained pipeline maps RxNorm codes to UMLS codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'rxnorm': ['1161611', '315677', '343663'],'umls': ['C3215948', 'C0984912', 'C1146501']}

nlu.load(en.resolve.rxnorm.mesh): This pretrained pipeline maps RxNorm codes to MeSH codes without using any text data. You’ll just feed white space-delimited RxNorm codes and it will return the corresponding MeSH codes as a list. If there is no mapping, the original code is returned with no mapping.

{'rxnorm': ['1191', '6809', '47613'],'mesh': ['D001241', 'D008687', 'D019355']}

nlu.load(en.resolve.snomed.umls): This pretrained pipeline maps SNOMED codes to UMLS codes without using any text data. You’ll just feed white space-delimited SNOMED codes and it will return the corresponding UMLS codes as a list. If there is no mapping, the original code is returned with no mapping.

{'snomed': ['733187009', '449433008', '51264003'],'umls': ['C4546029', 'C3164619', 'C0271267']}

New Healthcare Pipelines

NLU Reference	Spark NLP Reference
en.resolve.icd10cm.umls	icd10cm_umls_mapping
en.resolve.mesh.umls	mesh_umls_mapping
en.resolve.rxnorm.umls	rxnorm_umls_mapping
en.resolve.rxnorm.mesh	rxnorm_mesh_mapping
en.resolve.snomed.umls	snomed_umls_mapping
en.explain_doc.carp	explain_clinical_doc_carp
en.explain_doc.era	explain_clinical_doc_era

New Open Source Models and Pipelines

nlu.load() Refrence	Spark NLP Refrence
en.embed.distilbert	distilbert_base_cased
en.embed.distilbert.base	distilbert_base_cased
en.embed.distilbert.base.uncased	distilbert_base_uncased
en.embed.distilroberta	distilroberta_base
en.embed.roberta	roberta_base
en.embed.roberta.base	roberta_base
en.embed.roberta.large	roberta_large
xx.marian	opus_mt_en_fr
xx.embed.distilbert.	distilbert_base_multilingual_cased
xx.embed.xlm	xlm_roberta_base
xx.embed.xlm.base	xlm_roberta_base
xx.embed.xlm.twitter	twitter_xlm_roberta_base
zh.embed.bert	bert_base_chinese
zh.embed.bert.wwm	chinese_bert_wwm
de.embed.bert	bert_base_german_cased
de.embed.bert.uncased	bert_base_german_uncased
nl.embed.bert	bert_base_dutch_cased
it.embed.bert	bert_base_italian_cased
tr.embed.bert	bert_base_turkish_cased
tr.embed.bert.uncased	bert_base_turkish_uncased

1 line Install NLU on Google Colab

!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash

1 line Install NLU on Kaggle

!wget https://setup.johnsnowlabs.com/nlu/kaggle.sh -O - | bash

Install via PIP

! pip install nlu pyspark==3.0.3

Christian Kasim Loan

Senior Data Scientist at John Snow Labs

Our additional expert:

Christian Kasim Loan is a computer scientist with over 10 years of coding experience who works for John Snow Labs as a Senior Data Scientist where he helps porting the latest and greatest Machine Learning Models to Spark and created the NLU library.

All models from the NLP Models Hub are available for preannotation of text documents in the Annotation Lab

Nabin Khadka

A new generation of the NLP Lab is now available: the Generative AI Lab. Check details here https://www.johnsnowlabs.com/nlp-lab/ We are very excited...