Meet our team at HIMSS 2024 in Orlando, FL, March 11 - 15. Schedule a meeting.
was successfully added to your cart.

Tokenizing Asian texts into words with word segmentation models in Spark NLP

Use pretrained models, segment texts into words, and train custom word segmenter models with Python.

TL; DR: Some Asian languages don’t separate words by white space like English, and NLP practitioners need to tokenize the texts using word segmentation (WS) models. Spark NLP has pretrained models for WS and the capability to train new ones based on labeled data.

Spark NLP as a Chinese Word Segmenter

Word Segmentation is an NLP task for languages that don’t have white space between their words, for example, Chinese, Japanese, Thai, and Korean. Take this Chinese sentence as an example:


The words 今天 (today), 我 (I), 很 (very), and 高兴 (happy) don’t have any separation between them, and processing this sentence in a programmatically way can be hard. For these kinds of languages to perform NLP tasks, we need to be able to tokenize the sentences into their parts.

Common ways to do that include splitting the sentence character (or ideogram) by character, without taking into consideration that words can be formed with more than one of them. Another way is to use Word Segmentation (WS) models, which can be based on dictionaries or machine learning models to identify which subset of ideograms form the words to split them.

In this post, we are going to explore how to perform word segmentation using Spark NLP.

Introduction to Spark NLP

Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.

Since its first release in July 2017, Spark NLP has grown into a full NLP tool, providing:

  • A single unified solution for all your NLP needs
  • Transfer learning and implementing the latest and greatest SOTAalgorithms and models in NLP research
  • The most widely used NLP library in the industry (5 years in a row)
  • The most scalable, accurate, and fastest library in NLP history

Spark NLP comes with 12000+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.

Spark NLP modules

Spark NLP modules

Spark NLP processes the data using Pipelines, a structure that contains all the steps to be run on the input data:

Spark NLP pipelines

Spark NLP pipelines

Each step contains an annotator that performs a specific task, such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.

How to Install Spark NLP

To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:

pip install spark-nlp

For other installation options for different environments and machines, please check the official documentation.

Then, simply import the library and start a Spark session:

import sparknlp

spark = sparknlp.start()

Features of Chinese word segmentation with Spark NLP

Spark NLP currently has pretrained models for Chinese, Japanese, Korean, and Thai. Let’s use an example in Chinese:


In this example, the words are 我们 (we, composition of two ideograms), 都 (all, only one ideogram), 很 (very, only one ideogram), 喜欢 (like, composition of two ideograms), 自然语言处理 (NLP, composition of six ideograms) which can be split further down to 自然 (Natural), 语言 (Language), and 处理 (Processing).

The annotator to perform WS using pretrained models in Spark NLP is the WordSegmenterModel. This annotator uses DOCUMENT annotations as input and outputs TOKEN annotations. For a list of available models, check NLP Models Hub. The pipeline consists of only two stages, the document assembler that creates the DOCUMENT annotation from raw texts and the WordSegmenterModel stage to split the text into words.

To use pretrained models in Spark NLP, simply use the .pretrained() method of the desired annotator, and pass as parameters the name of the model to use and the language the model was trained in. In our example, we will use the model called wordseg_ctb9 which was trained on the Chinese Treebank version 9 dataset. You can explore other available models in the NLP Models Hub.

from sparknlp.base import DocumentAssembler
from sparknlp.annotator import WordSegmenterModel
from import Pipeline
import pyspark.sql.functions as F

document_assembler = (

# Model trained on the Chinese Treebank 9 dataset
word_segmenter = (
    WordSegmenterModel.pretrained("wordseg_ctb9", "zh")

pipeline = Pipeline(stages=[document_assembler, word_segmenter])

We then can transform the pipeline into a PipelineModel to make predictions. Since we only have pretrained stages, we can just fit the pipeline in empty data:

# Create an empty data frame with column `text`
empty_df = spark.createDataFrame([[""]]).toDF("text")

model =

Now we can apply the model to our example sentence, which needs to be in a spark data frame.

# Chinese example
example_sentence = r"我们都很喜欢自然语言处理!"
example_df = spark.createDataFrame([[example_sentence]]).toDF("text")

# Make predictions
result = model.transform(example_df)
|都  |
|很  |
|!  |

We can see that the model split Natural Language Processing into its parts and didn’t consider it as one word. This can happen to unfamiliar words or technical words which were not present or very scarce in the training data, so depending on your needs, choosing which model to use can have a significant impact.

Splitting words with one line of code

In October 2022, John Snow Labs released the open-source johnsnowlabs a library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow, especially for users that work with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all of John Snow Labs’ libraries, and can be installed with pip:

pip install johnsnowlabs

Please check the official documentation for more examples and usage of this library. To perform Chinese word segmentation, we can simply run:

Result from the one-liner word segmenter model

Result from the one-liner word segmenter model

NOTE: When using only the johnsnowlabs library, make sure you initialize the spark session with the configuration you have available. Since some libraries are licensed, you may need to set the path to your license file. If you are only using the open-source library, you can start the session with spark = nlp.start(nlp=False). The default parameters for the start function include using the licensed Healthcare NLP library with nlp=True, but we can set that to Falseand use all the resources of the open-source libraries such as Spark NLP, Spark NLP Display, and NLU.

The pretrained models are useful for most applications, but in some cases, they don’t fit the practitioner’s needs. If this happens, you can also train a new one!

Training a new WordSegmenterModel

To train a new model, we need to use the WordSegmenterApproach annotator. The implemented in Spark NLP model is a modification of the following reference paper:

Chinese Word Segmentation as Character Tagging (Xue, IJCLCLP 2003)

It is a machine-learning model that uses labeled data to learn in a supervised manner. The training data is a text file in the same format used to train Part-of-Speech (POS) models, meaning that each ideogram/character is tagged with a label and the characters are separated by a delimiter.

We will use the following as training data to train a simple Korean Word Segmenter model (character and tag are separated by | and characters-tag are separated by white space):


Where the labels are:

  • LL: The beginning of the word
  • MM: Middle part of the word
  • RR: The end of the word
  • LR: A word formed of only one character

We will save this example in a text file:

with open("train_data.txt", "w", encoding="utf8") as f:
  f.write("우|LL 리|RR 모|LL 두|MM 는|RR 자|LL 연|MM 어|RR 처|LL 리|MM 를|RR 좋|LL 아|MM 합|MM 니|MM 다|RR !|LR ")

To read this kind of dataset, you can use the helper class POS available in the module.

from import POS

train_data = POS().readDataset(spark, "train_data.txt")
|                         text|            document|                tags|
|우 리 모 두 는 자 연 어 처...|[{document, 0, 32...|[{pos, 0, 0, LL, ...|

The POS class creates a spark data frame with the text, document, and tags columns. They contain, respectively, the raw text, the DOCUMENT annotation, and the POS annotation of the corpus. The WordSegmenterApproach needs information on these columns is to train the model, so we need to set them as the input and POS columns:

wordSegmenter = (
    .setFrequencyThreshold(1) # Since our data is very small

pipeline = Pipeline().setStages([documentAssembler, wordSegmenter])

Then, to train the model we fit it into the training data:

pipelineModel =

Done, we have our brand-new Word Segmenter model. Let’s see how to use it in practice. We will use the LightPipeline to make predictions directly on strings instead of using spark data frames. You can learn more about this class in this link and in this previous post.

lp = LightPipeline(pipelineModel)

  "document": ["우리모두는자연어처리를좋아합니다!"],
  "token": ["우리", "모두는", "자연어", "처리를", "좋아합니다", "!"]

The results from Light Pipelines are a dictionary if only one string was passed or a list of dictionaries if a list of strings is used.

That’s it! Now you know how to train a new Word Segmenter model, as well as how to use a pretrained one!


In this post, you learned about word segmentation and how to perform this NLP task using pretrained models in Spark NLP. You also learned how to train a new model using labeled data and are ready to segment words in Chinese, Japanese, Korean, or Thai.


Mastering Dependency Parsing with Spark NLP and Python

Learn how to use Spark NLP and Python to analyze part of speech and grammar relations between words at scale Dependency parsing...