Unleash the Potential of Your Texts with Powerful Keyword Extraction Using Spark NLP and Python
TL; DR: Keyword extraction is the process of automatically extracting the most important words and phrases from a document or text. Using Spark NLP, it is possible to accurately extract keywords from any text.
Keyword extraction is a technique in natural language processing (NLP) that involves automatically identifying the most important words or phrases in a document or corpus. These keywords can be used in a variety of ways, including:
- Document indexing: Keywords extracted from a document can be used to index the document, making it easier to search for and retrieve relevant information.
- Document summarization: Keywords can be used to summarize the main topics or themes of a document, helping users to quickly understand its content without having to read the entire document.
- Content categorization: Keywords can be used to categorize documents based on their content, allowing users to easily find documents on specific topics.
- Content tagging: Keywords can be used to tag documents with relevant labels or metadata, making it easier to organize and manage large collections of documents.
- Search engine optimization: Keywords can be used to optimize web content for search engines, helping to improve its visibility and ranking in search results.
Keyword extraction can be done using a variety of techniques, including statistical methods, machine learning algorithms, and natural language processing tools.
Yake! is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages.
Unlike other approaches,
Yake! does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where the access to training corpora is either limited or restricted.
YAKE!‘s algorithm rests on text statistical features extracted from single documents. Each extracted keyword will be given a keyword score greater than 0 (the lower the score better the keyword).
In this post, you will learn how to use Spark NLP to perform keyword extraction using the YakeKeywordExtraction annotator.
Let us start with a short Spark NLP introduction and then discuss the details of keyword extraction with some solid results.
Introduction to Spark NLP
Spark NLP is an open-source library maintained by John Snow Labs. It is built on top of Apache Spark and Spark ML and provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment.
Since its first release in July 2017, Spark NLP has grown in a full NLP tool, providing:
- A single unified solution for all your NLP needs
- Transfer learning and implementing the latest and greatest SOTA algorithms and models in NLP research
- The most widely used NLP library in industry (5 years in a row)
- The most scalable, accurate and fastest library in NLP history
Spark NLP comes with 14,500+ pretrained pipelines and models in more than 250+ languages. It supports most of the NLP tasks and provides modules that can be used seamlessly in a cluster.
Spark NLP processes the data using
Pipelines, structure that contains all the steps to be run on the input data:
Each step contains an annotator that performs a specific task such as tokenization, normalization, and dependency parsing. Each annotator has input(s) annotation(s) and outputs new annotation.
An annotator in Spark NLP is a component that performs a specific NLP task on a text document and adds annotations to it. An annotator takes an input text document and produces an output document with additional metadata, which can be used for further processing or analysis. For example, a named entity recognizer annotator might identify and tag entities such as people, organizations, and locations in a text document, while a sentiment analysis annotator might classify the sentiment of the text as positive, negative, or neutral.
To install Spark NLP in Python, simply use your favorite package manager (conda, pip, etc.). For example:
pip install spark-nlp pip install pyspark
For other installation options for different environments and machines, please check the official documentation.
Then, simply import the library and start a Spark session:
import sparknlp # Start Spark Session spark = sparknlp.start()
Defining the Spark NLP Pipeline
YakeKeywordExtraction annotator expects
TOKEN as input, and then will provide
CHUNK as output. Thus, we need the previous steps to generate those annotations that will be used as input to our annotator.
Spark NLP has the pipeline approach and the pipeline will include the necessary stages.
YAKE! algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Detector and then a Tokenizer.
# Import the required modules and classes from sparknlp.base import DocumentAssembler, Pipeline from sparknlp.annotator import ( SentenceDetector, Tokenizer, YakeKeywordExtraction ) import pyspark.sql.functions as F # Step 1: Transforms raw texts to `document` annotation document = DocumentAssembler() \ .setInputCol("text") \ .setOutputCol("document") # Step 2: Sentence Detection sentenceDetector = SentenceDetector() \ .setInputCols("document") \ .setOutputCol("sentence") # Step 3: Tokenization token = Tokenizer() \ .setInputCols("sentence") \ .setOutputCol("token") \ .setContextChars(["(", ")", "?", "!", ".", ","]) # Step 4: Keyword Extraction keywords = YakeKeywordExtraction() \ .setInputCols("token") \ .setOutputCol("keywords") \ # Define the pipeline yake_pipeline = Pipeline(stages=[document, sentenceDetector, token, keywords]) # Create an empty dataframe empty_df = spark.createDataFrame([['']]).toDF("text") # Fit the dataframe to get the yake_Model = yake_pipeline.fit(empty_df)
LightPipeline is a Spark NLP specific Pipeline class equivalent to Spark ML Pipeline. The difference is that its execution does not hold to Spark principles, instead it computes everything locally (but in parallel) in order to achieve fast results when dealing with small amounts of data. This means, we do not input a Spark Dataframe, but a string or an Array of strings instead, to be annotated.
light_model = LightPipeline(yake_Model) text = ''' google is acquiring data science community kaggle. Sources tell us that google is acquiring kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that google is hosting its Cloud Next conference in san francisco this week, the official announcement could come as early as tomorrow. Reached by phone, kaggle co-founder ceo anthony goldbloom declined to deny that the acquisition is happening. google itself declined 'to comment on rumors'. kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With kaggle, google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). kaggle has a bit of a history with google, too, but that's pretty recent. Earlier this month, google and kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the google Cloud platform, too. Our understanding is that google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, kaggle did build some interesting tools for hosting its competition and 'kernels', too. On kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, kaggle also runs a job board, too. It's unclear what google will do with that part of the service. According to Crunchbase, kaggle raised $12.5 million (though PitchBook says it's $12.75) since its launch in 2010. Investors in kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, google chief economist Hal Varian, Khosla Ventures and Yuri Milner ''' light_result = light_model.fullAnnotate(text)
We can show the results in a Pandas DataFrame by running the following code:
keys_df = pd.DataFrame([(k.result, k.begin, k.end, k.metadata['score'], k.metadata['sentence']) for k in light_result['keywords']], columns = ['keywords','begin','end','score','sentence']) keys_df['score'] = keys_df['score'].astype(float) # ordered by relevance keys_df.sort_values(['sentence','score']).head(100)
We will not discuss the details here, but we have a notebook about the details of the keyword extraction process. also, you can check YakeKeywordExtraction page for the annotator parameters, which involves details to the keyword extraction process. The parameters will help you to:
- Define the minimum and maximum number of words in a keyword,
- Total number of keywords to be extracted from the text,
- Define a list of stopwords (to be ignored during extraction),
- Define a threshold to filter the extracted keywords.
Highlighting Keywords in a Text
In addition to getting the keywords as a dataframe, it is also possible to highlight the extracted keywords in the text.
In this example, a dataset of
7537 texts were used — samples from the PubMed, which is a free resource supporting the search and retrieval of biomedical and life sciences literature.
! wget -q https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/resources/en/pubmed/pubmed_sample_text_small.csv df = spark.read\ .option("header", "true")\ .csv("pubmed_sample_text_small.csv")\ df.show(truncate=50)
In this case, instead of using the LightPipeline, we use transform to get predictions from the dataframe of PubMed documents (df).
# Fit and transform the dataframe to get the predictions result = yake_pipeline.fit(df).transform(df) result = result.withColumn('unique_keywords', F.array_distinct("keywords.result")) def highlight(text, keywords): for k in keywords: text = (re.sub(r'(\b%s\b)'%k, r'<span style="background-color: yellow;">\1</span>', text, flags=re.IGNORECASE)) return text highlight_udf = udf(highlight, StringType()) result = result.withColumn("highlighted_keywords",highlight_udf('text','unique_keywords')) for r in result.select("highlighted_keywords").limit(20).collect(): display(HTML(r.highlighted_keywords)) print("\n\n")
You may prefer to get a visual result instead of a dataframe and you will get a nice result of texts, with the highlighted keywords.
For additional information, please consult the following references.
- Documentation: YakeKeywordExtraction
- Python Docs : YakeKeywordExtraction
- Scala Docs: YakeKeywordExtraction
- For extended examples of usage, see the Spark NLP Workshop repository.
- Reference Paper: YAKE! Keyword extraction from single documents using multiple local features
Keyword extraction is a NLP technique that involves identifying and extracting the most important words or phrases from a piece of text. The extracted keywords can be used to summarize the content of the text, to identify the main topics and themes discussed in the text, or to facilitate information retrieval.
YakeKeywordExtraction is a keyword extraction technique implemented in Spark NLP. Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages.