Spark NLP – the best thing since Numpy
Why is language understanding hard?
Built on the Spark ML API’s
Apache 2.0 Licensed
Active development & support
Our Choice of Architecture
Data has been taken from the National American Corpus (http://www.anc.org), utilizing the MASC 3.0.2 written corpora from the newspaper section.
Pipeline has Sentence Boundary, Tokenization & Part of Speech.
Spark-NLP was 38 times faster to train on 100kb of data.
Spark-NLP was 80 times faster to train on 2.6mb of data.
- 2.5x speedup with a 4-node cluster
- Zero code changes
Spark-NLP scales as Spark does: 1 to 3 orders of magnitude faster depending on cluster setup
Not compares to spaCy, since it cannot leverage a cluster