Text Cleaning: Standard Text Normalization with Spark NLP

07.06.2023

Gursev Pirge

Researcher and Data Scientist

The Normalizer annotator in Spark NLP performs text normalization on data. It is used to remove all dirty characters from text following a regex pattern, convert text to lowercase, remove special characters, punctuation, and transform words based on a provided dictionary.

The Normalizer annotator in Spark NLP is often used as part of a preprocessing step in NLP pipelines to improve the accuracy and quality of downstream analyses and models. It is a configurable component that can be customized based on specific use cases and requirements.

Normalization refers to the process of converting text to a standard format, which can help improve the accuracy and efficiency of subsequent natural language processing (NLP) tasks. Normalization typically involves removing or replacing certain characters or words, converting text to lowercase or uppercase, removing punctuation etc.

The Normalizer Annotator works by taking a token, and then applying a series of text transformations to it. These transformations can be configured by the user to meet their specific needs. Once the transformations have been applied, the output is returned as a normalized token.

P.S.: in Spark NLP, a token is defined as a basic unit of text that has been segmented from a larger body of text. Tokens are typically words, but they can also be punctuation, numbers, or other meaningful units of text. Tokenization is the process of breaking down text into tokens.

In this post, we will discuss how to use Spark NLP’s Normalizer annotator to perform normalization efficiently.

Let us start with a short Spark NLP introduction and then discuss the details of the normalizing techniques with some solid results.

Introduction to Spark NLP

Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.

Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing:

A single unified solution for all your NLP needs
Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
The most widely used NLP library in industry (5 years in a row)
The most scalable, accurate and fastest library in NLP history

Spark NLP comes with 17,800+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.

Spark NLP processes the data using Pipelines, structure that contains all the steps to be run on the input data:

Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.

An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.

Setup

To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:

pip install spark-nlp
pip install pyspark

For other installation options for different environments and machines, please check the official documentation.

Then, simply import the library and start a Spark session:

import sparknlp

# Start Spark Session
spark = sparknlp.start()

Normalizer

In NLP, a Normalizer is a component or module that is used to transform input text into a standard or normalized form. This is typically done to make text more consistent and easier to process.

Normalizer annotator in Spark NLP expects TOKEN as input, and then will provide TOKEN as output. This means that, input data will be in the form of individual tokens, and the output (the predictions) will be in the form of individual tokens (normalized) as well.

Thus, we need the previous steps to generate the tokens that will be used as an input to our annotator.

# Import the required modules and classes
from sparknlp.base import DocumentAssembler, Pipeline, LightPipeline
from sparknlp.annotator import (
    Tokenizer,
    Normalizer
)
import pyspark.sql.functions as F

# Step 1: Transforms raw texts to `document` annotation
document_assembler = (
    DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
)

# Step 2: Gets the tokens of the text
tokenizer = (
    Tokenizer()
    .setInputCols(["document"])
    .setOutputCol("token")
)

# Step 3:  Normalizes the tokens
normalizer = (
    Normalizer()
    .setInputCols(["token"])
    .setOutputCol("normalized")
)

pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer])

We use a sample text to see the performance of the Normalizer annotator, then fit and transform the dataframe to get the results:

# Create a dataframe from the sample_text
text = "John and Peter are brothers. 
       However they don't support each other that much. 
       John is 20 years old and Peter is 26"

data = spark.createDataFrame([[text]]) \
    .toDF("text")

# Fit and transform to get a prediction
model = pipeline.fit(data)
result = model.transform(data)

# Display the results
result.selectExpr("normalized.result").show(truncate = False)

Normalized tokens

The default settings of the Normalizer annotator will only keep alphabet letters ([^A-Za-z]), so the numbers in the sample text are removed.

Using LightPipeline

LightPipeline is a Spark NLP specific Pipeline class equivalent to Spark ML Pipeline. The difference is that its execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data. This means, we do not input a Spark Dataframe, but a string or an Array of strings instead, to be annotated.

We can show the results in a Pandas DataFrame by running the following code:

import pandas as pd

light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)

results_tabular = []
for res in light_result[0]["normalized"]:
    results_tabular.append(
        (
            res.begin,
            res.end,# res.result.token,
            res.result,
                   )
    )
pd.DataFrame(results_tabular, columns=["begin", "end", "chunk"])

The dataframe shows the tokens after normalizing the text

Additional Parameters

We can include additional parameters to the Normalizer in order to improve the results. For example, CleanupPatterns is a feature that allows you to specify a set of regular expressions to clean and preprocess your text data. By default, the annotator keeps only the alphabet letters, but you can create your own set of CleanupPatterns to include specific characters or patterns that you want.

In the code snippet below, (["""[^\w\d\s]"""]) removes all non-word, non-digit and non-space characters.

Also, the code will make sure that the normalized tokens will be lower case, and only the words with 3 or 4 characters will be selected.

normalizer = (
    Normalizer()
    .setInputCols(["token"])
    .setOutputCol("normalized")
    .setCleanupPatterns(["""[^\w\d\s]"""]) 
    .setLowercase(True)
    .setMaxLength(4)
    .setMinLength(3) #Default = 0
)

pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer])

# Run the same dataframe as above and display the results
model = pipeline.fit(data)
result = model.transform(data)

result.selectExpr("normalized.result").show(truncate = False)

Normalized tokens after using additional parameters

And the Light Pipeline results will be produced by:

light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)

results_tabular = []
for res in light_result[0]["normalized"]:
    results_tabular.append(
        (
            res.begin,
            res.end,
            res.result,
         ))

pd.DataFrame(results_tabular, columns=["begin", "end", "chunk"])

The dataframe shows the tokens after using additional parameters

Adding a Slang Dictionary

Another useful tool, especially when working with informal language or text from social media, is the option to use the setSlangDictionary parameter. It allows you to provide a list of words that are commonly used in a specific context or domain, but may not be recognized by standard language models. By providing this list, you can ensure that the model is able to accurately process and understand the text.

Here, we first create a csv file using the slangs list defined below, save the file:

slang_text = """Hey, what's good fam? I'm just chillin with my squad. 
                We're about to get to this dope party."""

#Create a Demo Slang CSV File
import csv
  
field_names = ['Slang', 'Correct_Word']
  
slangs = [
{'Slang': "fam", 'Correct_Word': 'friends'},
{'Slang': "chillin", 'Correct_Word': 'relaxing'},
{'Slang': "squad", 'Correct_Word': 'group of friends'},
{'Slang': "dope", 'Correct_Word': 'excellent'}
]
  
with open('slangs.csv', 'w') as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames = field_names)
    writer.writeheader()
    writer.writerows(slangs)

Now, use the setSlangDictionary parameter to define the file including the slang words:

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setSlangDictionary("/content/slangs.csv" ,",")

    
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([[slang_text]]) \
    .toDF("text")

model = pipeline.fit(data)
result = model.transform(data)
result.selectExpr("normalized.result").show(truncate = False)

And the Light Pipeline results will be produced by:

light_model = LightPipeline(model)

light_result = light_model.fullAnnotate(text)

results_tabular = []
for res in light_result[0]["normalized"]:
    results_tabular.append(
        (
            res.begin,
            res.end,
            res.result,
         ))

pd.DataFrame(results_tabular, columns=["begin", "end", "chunk"])

The dataframe shows the tokens after normalizing according to the slang dictionary

One-liner alternative

In October 2022, John Snow Labs released the open-source johnsnowlabs library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all John Snow Lab’s libraries, and can be installed with pip:

pip install johnsnowlabs

Please check the official documentation for more examples and usage of this library.

To normalize a text with one line of code, we can simply run the following code

from johnsnowlabs import nlp

text = '@CKL_IT says that #normalizers are pretty useful to clean 
#structured_strings in #NLU like tweets'

nlp.load('norm').predict(text)

After using the one-liner model, the result shows the original and normalized tokens

The one-liner is based on default models for each NLP task. Depending on your requirements, you may want to use the one-liner for simplicity or customizing the pipeline to choose specific models that fit your needs.

NOTE: when using only the johnsnowlabs library, make sure you initialize the spark session with the configuration you have available. Since some of the libraries are licensed, you may need to set the path to your license file. If you are only using the open-source library, you can start the session with spark = nlp.start(nlp=False). The default parameters for the start function includes using the licensed Healthcare NLP library with nlp=True, but we can set that to False and use all the resources of the open-source libraries such as Spark NLP, Spark NLP Display, and NLU.

For additional information, please check the following references.

Documentation : Normalizer
Python Docs : Normalizer
Scala Docs : Normalizer
For extended examples of usage, see the Spark NLP Workshop repository.

Conclusion

Normalizer in Spark NLP is a useful annotator that performs text normalization on data. It is an important preprocessing step in NLP pipelines that can help improve the accuracy and quality of downstream analyses and models. By standardizing the text data, the Normalizer annotator can reduce the noise and variability in the text, making it easier to analyze and extract meaningful information.

The Normalizer annotator in Spark NLP is a configurable component that can be customized based on specific use cases and requirements. It offers a variety of options for text normalization, such as converting text to lowercase, removing special characters, punctuation, and stop words, and even defining a slang dictionary.

Try Healthcare NLP

See in action

Gursev Pirge

Researcher and Data Scientist

Our additional expert:

A Researcher and Data Scientist with demonstrated success delivering innovative policies and machine learning algorithms, having strong statistical skills, and presenting to all levels of leadership to improve decision making. Experience in Education, Logistics, Data Analysis and Data Science. Strong education professional with a Doctor of Philosophy (Ph.D.) focused on Mechanical Engineering from Boğaziçi University.

Unlock the Power of BERT-based Models for Advanced Text Classification in Python

Gursev Pirge

Sequence classification with transformers refers to the task of predicting a label or category for a sequence of data, which can include...

Text Cleaning: Standard Text Normalization with Spark NLP

Introduction to Spark NLP

Setup

Normalizer

Using LightPipeline

Additional Parameters

Adding a Slang Dictionary

One-liner alternative

Conclusion

Unlock the Power of BERT-based Models for Advanced Text Classification in Python

Recommended For You