The growing availability of open-source natural language processing (NLP) toolkits has made it easier for practitioners to build tools with sophisticated linguistic processing, and for researchers to make scientific discoveries on natural language understanding.
In this talk, I will introduce Stanza, our latest Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition.
I will talk about Stanza’s neural architectural design, its simple user interface, and its improved performance against existing toolkits over a range of 112 datasets covering 66 languages. Next, I will talk about our recent efforts on extending Stanza’s language processing capabilities to the biomedical and clinical domains. With Stanza’s latest release, it now offers native support for accurate syntactic analysis and named entity recognition for biomedical literature text and clinical notes. I will introduce how these extensions are made and the performance of these models on standard biomedical NLP benchmarks.
Lastly, I will talk about Stanza’s Python interface to the widely used Stanford CoreNLP library, which extends Stanza’s functionality to an even richer range of tasks. I will close my talk by talking about our future plans for the Stanza library.