We are very excited to release NLU 1.1.1!
This release features 3 new tutorial notebooks for Open/Closed book question answering with Google’s T5, Intent classification, and Aspect Based NER.
In Addition, NLU 1.1.0 comes with 25+ pre-trained models and pipelines in Amharic, Bengali, Bhojpuri, Japanese, and Korean languages from the amazing Spark2.7.2 release. Finally, NLU now supports running on Spark 2.3 clusters.
NLU 1.1.0 New Non-English Models
Language | nlu.load() reference | Spark NLP Model reference | Type |
---|---|---|---|
Arabic | ar.ner | arabic_w2v_cc_300d | Named Entity Recognizer |
Arabic | ar.embed.aner | aner_cc_300d | Word Embedding |
Arabic | ar.embed.aner.300d | aner_cc_300d | Word Embedding (Alias) |
Bengali | bn.stopwords | stopwords_bn | Stopwords Cleaner |
Bengali | bn.pos | pos_msri | Part of Speech |
Thai | th.segment_words | wordseg_best | Word Segmenter |
Thai | th.pos | pos_lst20 | Part of Speech |
Thai | th.sentiment | sentiment_jager_use | Sentiment Classifier |
Thai | th.classify.sentiment | sentiment_jager_use | Sentiment Classifier (Alias) |
Chinese | zh.pos.ud_gsd_trad | pos_ud_gsd_trad | Part of Speech |
Chinese | zh.segment_words.gsd | wordseg_gsd_ud_trad | Word Segmenter |
Bihari | bh.pos | pos_ud_bhtb | Part of Speech |
Amharic | am.pos | pos_ud_att | Part of Speech |
NLU 1.1.1 New English Models and Pipelines
New Easy NLU 1-liner Examples:
Extract aspects and entities from airline questions (ATIS dataset)
nlu.load("en.ner.atis").predict("i want to fly from baltimore to dallas round trip") output: ["baltimore"," dallas", "round trip"]
Intent Classification for Airline Traffic Information System queries (ATIS dataset)
nlu.load("en.classify.questions.atis").predict("what is the price of flight from newyork to washington") output: "atis_airfare"
Recognize Entities OntoNotes – ELECTRA Large
nlu.load("en.ner.onto.large").predict("Johnson first entered politics when elected in 2001 as a member of Parliament. He then served eight years as the mayor of London.") output: ["Johnson", "first", "2001", "eight years", "London"]
Question classification of open-domain and fact-based questions Pipeline – TREC50
nlu.load("en.classify.trec50.pipe").predict("When did the construction of stone circles begin in the UK? ") output: LOC_other
Traditional Chinese Word Segmentation
# 'However, this treatment also creates some problems' in Chinese nlu.load("zh.segment_words.gsd").predict("然而,這樣的處理也衍生了一些問題。") output: ["然而",",","這樣","的","處理","也","衍生","了","一些","問題","。"]
Part of Speech for Traditional Chinese
# 'However, this treatment also creates some problems' in Chinese nlu.load("zh.pos.ud_gsd_trad").predict("然而,這樣的處理也衍生了一些問題。")
Output:
Token | POS |
---|---|
然而 | ADV |
, | PUNCT |
這樣 | PRON |
的 | PART |
處理 | NOUN |
也 | ADV |
衍生 | VERB |
了 | PART |
一些 | ADJ |
問題 | NOUN |
。 | PUNCT |
Thai Word Segment Recognition
# 'Mona Lisa is a 16th-century oil painting created by Leonardo held at the Louvre in Paris' in Thai nlu.loadnlu.load("th.segment_words").predict("Mona Lisa เป็นภาพวาดสีน้ำมันในศตวรรษที่ 16 ที่สร้างโดย Leonardo จัดขึ้นที่พิพิธภัณฑ์ลูฟร์ในปารีส")
Output:
token |
---|
M |
o |
n |
a |
Lisa |
เป็น |
ภาพ |
ว |
า |
ด |
สีน้ำ |
มัน |
ใน |
ศตวรรษ |
ที่ |
16 |
ที่ |
สร้าง |
โ |
ด |
ย |
L |
e |
o |
n |
a |
r |
d |
o |
จัด |
ขึ้น |
ที่ |
พิพิธภัณฑ์ |
ลูฟร์ |
ใน |
ปารีส |
Part of Speech for Bengali (POS)
# 'The village is also called 'Mod' in Tora language' in Bengali nlu.load("bn.pos").predict("বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷")
Output:
token | pos |
---|---|
বাসস্থান-ঘরগৃহস্থালি | NN |
তোড়া | NNP |
ভাষায় | NN |
গ্রামকেও | NN |
বলে | VM |
` | SYM |
মোদ | NN |
‘ | SYM |
৷ | SYM |
Stop Words Cleaner for Bengali
# 'This language is not enough' in Bengali df = nlu.load("bn.stopwords").predict("এই ভাষা যথেষ্ট নয়")
Output:
cleanTokens | token |
---|---|
ভাষা | এই |
যথেষ্ট | ভাষা |
নয় | যথেষ্ট |
None | নয় |
Part of Speech for Bengali
# 'The people of Ohu know that the foundation of Bhojpuri was shaken' in Bengali nlu.load('bh.pos').predict("ओहु लोग के मालूम बा कि श्लील होखते भोजपुरी के नींव हिल जाई")
Output:
pos | token |
---|---|
DET | ओहु |
NOUN | लोग |
ADP | के |
NOUN | मालूम |
VERB | बा |
SCONJ | कि |
ADJ | श्लील |
VERB | होखते |
PROPN | भोजपुरी |
ADP | के |
NOUN | नींव |
VERB | हिल |
AUX | जाई |
Amharic Part of Speech (POS)
# ' "Son, finish the job," he said.' in Amharic nlu.load('am.pos').predict('ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ"')
Output:
pos | token |
---|---|
NOUN | ልጅ |
DET | ኡ |
PART | ን |
NOUN | ሥራ |
DET | ው |
PART | ን |
VERB | አስጨርስ |
PRON | ኧው |
AUX | ኣል |
PRON | ኧሁ |
PUNCT | ። |
NOUN | “ |
Thai Sentiment Classification
# 'I love peanut butter and jelly!' in thai nlu.load('th.classify.sentiment').predict('ฉันชอบเนยถั่วและเยลลี่!')[['sentiment','sentiment_confidence']]
Output:
sentiment | sentiment_confidence |
---|---|
positive | 0.999998 |
Arabic Named Entity Recognition (NER)
# 'In 1918, the forces of the Arab Revolt liberated Damascus with the help of the British' in Arabic nlu.load('ar.ner').predict('في عام 1918 حررت قوات الثورة العربية دمشق بمساعدة من الإنكليز',output_level='chunk')[['entities_confidence','ner_confidence','entities']]
Output:
entity_class | ner_confidence | entities |
---|---|---|
ORG | [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] | قوات الثورة العربية |
LOC | [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] | دمشق |
PER | [1.0, 1.0, 1.0, 0.9997000098228455, 0.9840999841690063, 0.9987999796867371, 0.9990000128746033, 0.9998999834060669, 0.9998999834060669, 0.9993000030517578, 0.9998999834060669] | الإنكليز |
NLU 1.1.0 Enhancements
-
Spark 2.3 compatibility
New NLU Notebooks and Tutorials
Intent Classification for Airline emssages ATIS
Installation
# PyPi !pip install nlu pyspark==2.4.7 #Conda # Install NLU from Anaconda/Conda conda install -c johnsnowlabs nlu