Meet us at the Data + AI Summit, June 28th from 10 AM to 3 PM ET at the Healthcare and Life Sciences Booth in the Industry Lounge. Book now
was successfully added to your cart.

100+ Transformer Models in 40+ languages, Streamlit Entity Manifold visualizations, Trainable Sentence Resolvers, Memory Optimization, and much more in NLU

We are extremely excited to announce the release of NLU 3.2, which marks the 1-year anniversary of the birth of this magical library.

This release packs features and improvements in every division of NLU’s aspects, 89 new NLP models with new Models including Longformer, TokenBert, TokenDistilBert and Multi-Lingual NER for 40+ Languages.

12 new Healthcare models with trainable sentence resolvers and models Adverse Drug Relations, Clinical Token Bert Models, NER Models for Radiology, Drugs, Posology, Administration Cycles, RXNorm, and new Medical Assertion models.

New Streamlit visualizations enable you to see Entities in 3-D, 2-D, and 1-D Manifolds which are applicable to Entities and their Embeddings, Detected by Named-Entity-Recognizer models.

Finally, a ~7% decrease in Memory consumption in NLU’s core which benefits every computation, was achieved by leveraging Pyarrow.

We are incredibly thankful to our community, which helped us come this far, and are looking forward to another magical year of NLU!

Streamlit Entity Manifold visualization

function pipe.viz_streamlit_entity_embed_manifold

Visualize recognized entities by NER models via their Entity Embeddings in 1-D, 2-D, or 3-D by Reducing Dimensionality via 10+ Supported methods from Manifold Algorithms and Matrix Decomposition Algorithms.

You can pick additional NER models and compare them via the GUI dropdown on the left.

  • Reduces Dimensionality of high dimensional Entity Embeddings to 1-D, 2-D, or 3-D and plot the resulting data in an interactive Plotly plot
  • Applicable with any of the 330+ Named Entity Recognizer models
nlu.load('ner').viz_streamlit_sentence_embed_manifold(['Hello From John Snow Labs', 'Peter loves to visit New York'])

or just run

streamlit run <a href=""></a>

function parameters pipe.viz_streamlit_sentence_embed_manifold

Argument Type Default Description
default_texts List[str] “Donald Trump likes to visit New York”, “Angela Merkel likes to visit Berlin!”, ‘Peter hates visiting Paris’) List of strings to apply classifiers, embeddings, and manifolds to.
title str 'NLU ❤️ Streamlit - Prototype your NLP startup in 0 lines of code🚀' Title of the Streamlit app
sub_title Optional[str] “Apply any of the 10+ Manifold or Matrix Decompositionalgorithms to reduce the dimensionality of Entity Embeddings to 1-D, 2-D and 3-D Sub title of the Streamlit app
default_algos_to_apply List[str] ["TSNE", "PCA"] A list Manifold and Matrix Decomposition Algorithms to apply. Can be either 'TSNE','ISOMAP','LLE','Spectral Embedding', 'MDS','PCA','SVD aka LSA','DictionaryLearning','FactorAnalysis','FastICA'or 'KernelPCA',
target_dimensions List[int] (1,2,3) Defines the target dimension embeddings will be reduced to
show_algo_select bool True Show selector for Manifold and Matrix Decomposition Algorithms
set_wide_layout_CSS bool True Whether to inject custom CSS or not.
num_cols int 2 How many columns should for the layout in streamlit when rendering the similarity matrixes.
key str "NLU_streamlit" Key for the Streamlit elements drawn
show_logo bool True Show logo
display_infos bool False Display additonal information about ISO codes and the NLU namespace structure.
n_jobs Optional[int] 3 False

Sentence Entity Resolver Training

Sentence Entity Resolver Training Tutorial Notebook. Named Entities are sub pieces in textual data which are labeled with classes.

These classes and strings are still ambiguous though and it is not possible to group semantically identically entities without any definition of terminology.

With the Sentence Resolver you can train a state-of-the-art deep learning architecture to map entities to their unique terminological representation.

Train a Sentence resolver on a dataset with columns named y , _y and text. y is a label, _y is an extra identifier label, text is the raw text:

    import pandas as pd 
    import nlu
    dataset = pd.DataFrame({
        'text': ['The Tesla company is good to invest is', 'TSLA is good to invest','TESLA INC. we should buy','PUT ALL MONEY IN TSLA inc!!'],
        'y': ['23','23','23','23'],
        '_y': ['TESLA','TESLA','TESLA','TESLA'],


    trainable_pipe = nlu.load('train.resolve_sentence')
    fitted_pipe  =
    res = fitted_pipe.predict(dataset)
    fitted_pipe.predict(["Peter told me to buy Tesla ", 'I have money to loose, is TSLA a good option?'])

sentence_resolution_resolve_sentence_confidence sentence_resolution_resolve_sentence_code sentence_resolution_resolve_sentence sentence
0 ‘1.0000’ ’23’ ‘TESLA’ ‘The Tesla company is good to invest is’
1 ‘1.0000’ ’23’ ‘TESLA’ ‘TSLA is good to invest’
2 ‘1.0000’ ’23’ ‘TESLA’ ‘TESLA INC. we should buy’
3 ‘1.0000’ ’23’ ‘TESLA’ ‘PUT ALL MONEY IN TSLA inc!!’

Alternatively, you can also use non-default healthcare embeddings.

trainable_pipe = nlu.load('en.embed.glove.biovec train.resolve_sentence')

Transformer Models

New models from the spectacular Spark NLP 3.2.0 + releases are integrated. 89 new models in total, with new LongFormer, TokenBert, TokenDistilBert and Multi-Lingual NER for 40+ languages.

The supported languages with their ISO 639-1 code are : af, ar, bg, bn, de, el, en, es, et, eu, fa, fi, fr, he, hi, hu, id, it, ja, jv, ka, kk, ko, ml, mr, ms, my, nl, pt, ru, sw, ta, te, th, tl, tr, ur, vi, yo, and zh

nlu.load() Refrence Spark NLP Refrence Annotator Class language
en.embed.longformer longformer_base_4096 LongformerEmbeddings en
en.embed.longformer.large longformer_large_4096 LongformerEmbeddings en
en.ner.ontonotes_roberta_base ner_ontonotes_roberta_base NerDLModel en
en.ner.ontonotes_roberta_large ner_ontonotes_roberta_large NerDLModel en
en.ner.ontonotes_distilbert_base_cased ner_ontonotes_distilbert_base_cased NerDLModel en
en.ner.conll_bert_base_cased ner_conll_bert_base_cased NerDLModel en
en.ner.conll_distilbert_base_cased ner_conll_distilbert_base_cased NerDLModel en
en.ner.conll_roberta_base ner_conll_roberta_base NerDLModel en
en.ner.conll_roberta_large ner_conll_roberta_large NerDLModel en
en.ner.conll_xlm_roberta_base ner_conll_xlm_roberta_base NerDLModel en
en.ner.conll_longformer_large_4096 ner_conll_longformer_large_4096 NerDLModel en
en.embed.token_bert.conll03 bert_base_token_classifier_conll03 NerDLModel en
en.embed.token_bert.large_conll03 bert_large_token_classifier_conll03 NerDLModel en
en.embed.token_bert.ontonote bert_base_token_classifier_ontonote NerDLModel en
en.embed.token_bert.large_ontonote bert_large_token_classifier_ontonote NerDLModel en
en.embed.token_bert.few_nerd bert_base_token_classifier_few_nerd NerDLModel en
fa.embed.token_bert.parsbert_armanner bert_token_classifier_parsbert_armanner NerDLModel fa
fa.embed.token_bert.parsbert_ner bert_token_classifier_parsbert_ner NerDLModel fa
fa.embed.token_bert.parsbert_peymaner bert_token_classifier_parsbert_peymaner NerDLModel fa
tr.embed.token_bert.turkish_ner bert_token_classifier_turkish_ner NerDLModel tr
es.embed.token_bert.spanish_ner bert_token_classifier_spanish_ner NerDLModel es
sv.embed.token_bert.swedish_ner bert_token_classifier_swedish_ner NerDLModel sv
en.ner.fewnerd nerdl_fewnerd_100d NerDLModel en
en.ner.fewnerd_subentity nerdl_fewnerd_subentity_100d NerDLModel en ner_mit_movie_complex_bert_base_cased NerDLModel en
en.ner.movie_complex ner_mit_movie_complex_bert_base_cased NerDLModel en
en.ner.movie_simple ner_mit_movie_complex_bert_base_cased NerDLModel en
en.ner.mit_movie_complex_bert ner_mit_movie_complex_bert_base_cased NerDLModel en
en.ner.mit_movie_complex_distilbert ner_mit_movie_complex_distilbert_base_cased NerDLModel en
en.ner.mit_movie_simple ner_mit_movie_simple_distilbert_base_cased NerDLModel en
en.embed_sentence.bert_use_cmlm_en_base sent_bert_use_cmlm_en_base BertSentenceEmbeddings en
en.embed_sentence.bert_use_cmlm_en_large sent_bert_use_cmlm_en_large BertSentenceEmbeddings en
xx.ner.xtreme_glove_840B_300 ner_xtreme_glove_840B_300 NerDLModel xx
xx.ner.xtreme_xlm_roberta_xtreme_base ner_xtreme_xlm_roberta_xtreme_base NerDLModel xx
xx.ner.wikiner_glove_840B_300 ner_wikiner_glove_840B_300 NerDLModel xx
xx.ner.wikiner_xlm_roberta_base ner_wikiner_xlm_roberta_base NerDLModel xx
xx.embed_sentence.bert_use_cmlm_multi_base_br sent_bert_use_cmlm_multi_base_br BertSentenceEmbeddings xx
xx.embed_sentence.bert_use_cmlm_multi_base sent_bert_use_cmlm_multi_base BertSentenceEmbeddings xx
xx.embed.xlm_roberta_xtreme_base xlm_roberta_xtreme_base XlmRoBertaEmbeddings xx
xx.embed.bert_base_multilingual_cased bert_base_multilingual_cased Embeddings xx
xx.embed.bert_base_multilingual_uncased bert_base_multilingual_uncased Embeddings xx opus_tatoeba_af_ru Translation xx opus_tatoeba_he_fr Translation xx opus_tatoeba_it_he Translation xx opus_mt_cs_sv Translation xx
tr.classify.cyberbullying classifierdl_berturk_cyberbullying Pipelines tr
zh.embed.xlnet chinese_xlnet_base Embeddings zh classifierdl_bert_news Pipelines de
tr.classify.berturk_cyberbullying classifierdl_berturk_cyberbullying_pipeline Pipelines tr
de.classify.bert_news classifierdl_bert_news_pipeline Pipelines de
en.classify.electra_questionpair classifierdl_electra_questionpair_pipeline Pipelines en
tr.classify.bert_news classifierdl_bert_news_pipeline Pipelines tr
en.ner.conll_elmo ner_conll_elmo NerDLModel en
en.ner.conll_albert_base_uncased ner_conll_albert_base_uncased NerDLModel en
en.ner.conll_albert_large_uncased ner_conll_albert_large_uncased NerDLModel en
en.ner.conll_xlnet_base_cased ner_conll_xlnet_base_cased NerDLModel en
xx.embed.bert.muril bert_muril BertEmbeddings xx
en.embed.bert.wiki_books_sst2 bert_wiki_books_sst2 BertEmbeddings en
en.embed.bert.wiki_books_squad2 bert_wiki_books_squad2 BertEmbeddings en
en.embed.bert.wiki_books_qqp bert_wiki_books_qqp BertEmbeddings en
en.embed.bert.wiki_books_qnli bert_wiki_books_qnli BertEmbeddings en
en.embed.bert.wiki_books_mnli bert_wiki_books_mnli BertEmbeddings en
en.embed.bert.wiki_books bert_wiki_books BertEmbeddings en
en.embed.bert.pubmed_squad2 bert_pubmed_squad2 BertEmbeddings en
en.embed.bert.pubmed bert_pubmed BertEmbeddings en
en.embed_sentence.bert.wiki_books_sst2 sent_bert_wiki_books_sst2 BertSentenceEmbeddings en
en.embed_sentence.bert.wiki_books_squad2 sent_bert_wiki_books_squad2 BertSentenceEmbeddings en
en.embed_sentence.bert.wiki_books_qqp sent_bert_wiki_books_qqp BertSentenceEmbeddings en
en.embed_sentence.bert.wiki_books_qnli sent_bert_wiki_books_qnli BertSentenceEmbeddings en
en.embed_sentence.bert.wiki_books_mnli sent_bert_wiki_books_mnli BertSentenceEmbeddings en
en.embed_sentence.bert.wiki_books sent_bert_wiki_books BertSentenceEmbeddings en
en.embed_sentence.bert.pubmed_squad2 sent_bert_pubmed_squad2 BertSentenceEmbeddings en
en.embed_sentence.bert.pubmed sent_bert_pubmed BertSentenceEmbeddings en
xx.embed_sentence.bert.muril sent_bert_muril BertSentenceEmbeddings xx
yi.detect_sentence sentence_detector_dl SentenceDetectorDLModel yi
uk.detect_sentence sentence_detector_dl SentenceDetectorDLModel uk
te.detect_sentence sentence_detector_dl SentenceDetectorDLModel te
ta.detect_sentence sentence_detector_dl SentenceDetectorDLModel ta
so.detect_sentence sentence_detector_dl SentenceDetectorDLModel so
sd.detect_sentence sentence_detector_dl SentenceDetectorDLModel sd
ru.detect_sentence sentence_detector_dl SentenceDetectorDLModel ru
pa.detect_sentence sentence_detector_dl SentenceDetectorDLModel pa
ne.detect_sentence sentence_detector_dl SentenceDetectorDLModel ne
mr.detect_sentence sentence_detector_dl SentenceDetectorDLModel mr
ml.detect_sentence sentence_detector_dl SentenceDetectorDLModel ml
kn.detect_sentence sentence_detector_dl SentenceDetectorDLModel kn
bs.detect_sentence sentence_detector_dl SentenceDetectorDLModel bs
id.detect_sentence sentence_detector_dl SentenceDetectorDLModel id
gu.detect_sentence sentence_detector_dl SentenceDetectorDLModel gu

New Healthcare Transformer Models

12 new models from the amazing Spark NLP for Healthcare 3.2.0+ releases, including models for genetic variants, radiology, assertion, rxnorm, adverse drugs and new clinical tokenbert models that improve accuracy by 4% compared to the previous models.

nlu.load() Refrence Spark NLP Refrence Annotator Class
en.med_ner.radiology.wip_greedy_biobert jsl_rd_ner_wip_greedy_biobert MedicalNerModel
en.med_ner.genetic_variants ner_genetic_variants MedicalNerModel
en.med_ner.jsl_slim ner_jsl_slim MedicalNerModel
en.med_ner.jsl_greedy_biobert ner_jsl_greedy_biobert MedicalNerModel
en.embed.token_bert.ner_clinical bert_token_classifier_ner_clinical MedicalNerModel
en.embed.token_bert.ner_jsl bert_token_classifier_ner_jsl MedicalNerModel
en.relation.ade redl_ade_biobert RelationExtractionDLModel
en.relation.ade_clinical re_ade_clinical RelationExtractionDLModel
en.relation.ade_biobert re_ade_biobert RelationExtractionDLModel
en.resolve.rxnorm_disposition sbiobertresolve_rxnorm_disposition SentenceEntityResolverModel
en.assert.jsl assertion_jsl AssertionDLModel
en.assert.jsl_large assertion_jsl_large AssertionDLModel

PyArrow Memory Optimizations

Optimized integration with Pyarrow to share memory between the Python Virtual Machine and Java Virtual Machine which yields around 7% less memory consumption on average in all computations. This improvement will take effect for everyone using the default Pyspark installation, which comes with a compatible Pyarrow Version.

If you manually install or upgrade Pyarrow, please refer to the official Spark docs and make sure you have a Pyarrow version installed that works with your Pyspark version.

Memory Benchmark

New Notebooks

Additional NLU Resources