A Powerful Tool for Sentiment Analysis
Sentiment analysis, also known as opinion mining, is a computational approach that aims to identify and extract subjective information from text data. This technique has gained significant attention in recent years due to its applications in various fields, including marketing, customer service, political analysis, and social media monitoring.
Sentiment analysis is a computational technique that aims to extract subjective information from textual data. It has widespread applications in various fields, including marketing, customer service, political analysis, and social media monitoring.
One of the notable approaches to sentiment analysis is the Vivek approach (sentiment analyzer inspired by the algorithm by Vivek Narayanan) used by Spark NLP, which is based on the use of lexicons or dictionaries that contain words and phrases associated with positive or negative sentiment. The Vivek approach has shown promising results in various studies and has been used in several applications, including sentiment analysis of social media data, movie reviews, and product reviews. It is a simple and effective approach that can be easily implemented using open-source tools and libraries.
In this post, you will learn how to use ViveknSentiment annotator of Spark NLP to perform sentiment analysis using a pretrained model and also learn how to easily train your own models for the same purpose.
Spark NLP has multiple approaches for detecting the sentiment (which is actually a text classification problem) in a text. There are separate blog posts for the rule-based system. and a deep learning approach.
Let us start with a short Spark NLP introduction and then discuss the details of the Vivek sentiment analysis technique with some solid results.
Introduction to Spark NLP
Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.
Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing:
- A single unified solution for all your NLP needs
- Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
- The most widely used NLP library in industry (5 years in a row)
- The most scalable, accurate and fastest library in NLP history
Spark NLP comes with 17,800+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.
Spark NLP processes the data using
Pipelines, structure that contains all the steps to be run on the input data:
Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.
An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.
To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:
pip install spark-nlp pip install pyspark
For other installation options for different environments and machines, please check the official documentation.
Then, simply import the library and start a Spark session:
import sparknlp # Start Spark Session spark = sparknlp.start()
Enhanced Naive Bayes Model Approach — ViveknSentiment
ViveknSentiment uses Naive-Bayes classifier for sentiment analysis. In statistics, Naive-Bayes classifiers are a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong independence assumptions between the features.
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor x on a given class c is independent of the values of other predictors. This assumption is called class conditional independence.
Please check the academic paper named ‘Fast and accurate sentiment classification using an enhanced Naive Bayes model’, to understand the theory behind finding the maximum likelihood probability of a word belonging to a particular class.
ViveknSentimentApproach annotator is used for training models and
ViveknSentimentModelannotator is used for getting predictions from the trained model.
First, we will use a pretrained ViveknSentiment model from John Snow Labs Model’s Hub in order to predict the sentiment in a text. Then, by using just 6 labelled sentences, we will train a model and get predictions by using this model.
Use a Model from John Snow Labs Model’s Hub
The ViveknSentiment annotator expects
TOKEN as input, and then will provide
SENTIMENT as output. Thus, we need the previous steps to generate those annotations that will be used as input to our annotator.
Please notice that Normalizer is added to the pipeline as the third stage and the target was to clean out the tokens before feeding them to the sentiment detection stage.
Please check the details of the pipeline below:
# Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline, Finisher from sparknlp.annotator import ( Tokenizer, Normalizer, ViveknSentimentModel ) import pyspark.sql.functions as F # Step 1: Transforms raw texts to `document` annotation document_assembler = ( DocumentAssembler() .setInputCol("text") .setOutputCol("document") ) # Step 2: Tokenization tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token") # Step 3: Normalizer normalizer= Normalizer().setInputCols("token").setOutputCol("normal") # Step 4: Sentiment Detection vivekn= ( ViveknSentimentModel.pretrained('sentiment_vivekn') .setInputCols(["document", "normal"]) .setOutputCol("result_sentiment") .setFeatureLimit(4) ) # Step 5: Finisher finisher= ( Finisher() .setInputCols(["result_sentiment"]).setOutputCols("final_sentiment") ) # Define the pipeline pipeline = Pipeline( stages=[ document_assembler, tokenizer, normalizer, vivekn, finisher ] )
We will use a dataframe (data) with two sample sentences and get sentiment predictions for them.
# Define dataframe with sample texts data = spark.createDataFrame([ ["I recommend this movie"], ["Dont waste your time!!!"] ]).toDF("text") model = pipeline.fit(data) result = model.transform(data) # Display both the sample text and the predicted sentiment result.select("text", "final_sentiment").show(truncate=False)
LightPipeline is a Spark NLP specific Pipeline class equivalent to Spark ML Pipeline. The difference is that its execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data. This means, we do not input a Spark Dataframe, but a string or an array of strings instead, to be annotated.
We can show the results in a Pandas DataFrame by running the following code:
import pandas as pd text = "This is a terrible movie" light_model = LightPipeline(model) light_result = light_model.fullAnnotate(text) results_tabular =  for res in light_result["result_sentiment"]: results_tabular.append( ( text, res.result )) pd.DataFrame(results_tabular, columns=['text', "final_sentiment"])
In October 2022, John Snow Labs released the open-source
johnsnowlabs library that contains all the company products, open-source and licensed, under one common library. This simplified the workflow especially for users working with more than one of the libraries (e.g., Spark NLP + Healthcare NLP). This new library is a wrapper on all John Snow Lab’s libraries, and can be installed with
pip install johnsnowlabs
Please check the official documentation for more examples and usage of this library. To run sentiment analysis with one line of code, we can simply:
# Import the NLP module which contains Spark NLP and NLU libraries from johnsnowlabs import nlp nlp.load("en.sentiment.vivekn").predict("""This is a terrible movie!""")
The one-liner is based on default models for each NLP task. Depending on your requirements, you may want to use the one-liner for simplicity or customizing the pipeline to choose specific models that fit your needs.
NOTE: when using only the
johnsnowlabs library, make sure you initialize the spark session with the configuration you have available. Since some of the libraries are licensed, you may need to set the path to your license file. If you are only using the open-source library, you can start the session with
spark = nlp.start(nlp=False). The default parameters for the start function includes using the licensed Healthcare NLP library with
nlp=True, but we can set that to
False and use all the resources of the open-source libraries such as Spark NLP, Spark NLP Display, and NLU.
Train a Model by ViveknSentimentApproach
In order to show the capacity of the
ViveknSentimentApproach annotator in model training, let us train a model with just 6 labelled sentences and then use this trained model to get predictions.
The pipeline below is quite similar to the one that we used for
# Import the required modules and classes from sparknlp.annotator import ( ViveknSentimentApproach ) document_assembler = ( DocumentAssembler() .setInputCol("text") .setOutputCol("document") ) tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token") normalizer= Normalizer().setInputCols("token").setOutputCol("normal") vivekn= ( ViveknSentimentApproach() .setInputCols(["document", "normal"]) .setOutputCol("result_sentiment") .setSentimentCol("train_sentiment") ) finisher= ( Finisher() .setInputCols(["result_sentiment"]).setOutputCol("final_sentiment") ) pipeline = Pipeline( stages=[ document_assembler, tokenizer, normalizer, vivekn, finisher ] )
We will use the labelled sentences below for model training:
training = spark.createDataFrame([ ("I really liked this movie!", "positive"), ("The cast was horrible", "negative"), ("Never going to watch this again or recommend it to anyone", "negative"), ("It's a waste of time", "negative"), ("I loved the protagonist", "positive"), ("The music was really really good", "positive") ]).toDF("text", "train_sentiment")
Now, we fit the training dataset and then using this model, get sentiment predictions for the simple sentences below:
pipelineModel = pipeline.fit(training) data = spark.createDataFrame([ ["I recommend this movie"], ["Dont waste your time!!!"] ]).toDF("text") result = pipelineModel.transform(data) result.select("text", "final_sentiment").show(truncate=False)
For additional information, please check the following references.
- Documentation : ViveknSentiment
- Python Docs : ViveknSentimentModel, ViveknSentimentApproach
- Scala Docs : ViveknSentimentModel, ViveknSentimentApproach
- For extended examples of usage, see the Spark NLP Workshop repository.
- Academic Reference Paper: Fast and accurate sentiment classification using an enhanced Naive Bayes model.
- For LightPipelines, check this post.
Sentiment analysis using the ViveknSentiment annotator of Spark NLP provides a powerful and efficient approach for analyzing and extracting sentiment from textual data. The ViveknSentiment annotator is based on the use of lexicons or dictionaries to assign sentiment scores to words and phrases, which are then aggregated to determine the overall sentiment of the text.
Despite its simplicity, the ViveknSentiment annotator of Spark NLP has been shown to perform well in various applications, including social media monitoring, product reviews, and movie reviews. However, like any computational approach, it is not without limitations, such as the need for high-quality lexicons and the difficulty in handling sarcasm and irony.
Also, it can be limited in its ability to generalize to new or unseen text and more training data will provide better results.