was successfully added to your cart.

    Smarter Text Matching for Real-World NLP in Healthcare Use Cases

    The latest version of TextMatcher in Healthcare NLP introduces powerful linguistic enhancements such as lemmatization, stemming, stopwords removal, and token shuffling. These new features provide a flexible yet efficient way to perform high-quality phrase matching in unstructured text.

    Text matching is a critical component in many NLP pipelines — from clinical concept recognition to entity extraction in unstructured text. But traditional methods often fall short when dealing with the fluid and inconsistent nature of language. Inflected words, reordered tokens, irrelevant stopwords, or domain-specific noise can easily cause exact-match systems to overlook valuable information.

    With the latest update to Healthcare NLP, the TextMatcher module introduces a smarter, more linguistically aware text matching engine. It brings support for lemmatization, stemming, token shuffling, customizable stopword handling, and fine-grained control over matching behavior — all designed to significantly improve both recall and precision in complex or noisy text.

    In this post, we’ll walk through each of the new capabilities, show how they work in practice, and explain how they help you build more resilient and context-aware NLP systems.

    Named Entity Recognition (NER) Methods and Tools - Healthcare NLP

    Healthcare NLP offers three main approaches for extracting entities from clinical and medical text: Machine Learning-based, Rule-Based, and LLM-based methods. Each approach has its own advantages depending on the use case and requirements.
    Additionally, tools like ContextualEntityRuler and ContextualEntityFilterer provide fine-tuning and output filtering capabilities to enhance rule-based pipelines. This comprehensive ecosystem enables both high accuracy and flexibility.
    If you’d like to explore the practical use of NER models in Healthcare NLP, check out the following resources:

    The table below summarizes these methods, models, and tools with brief descriptions, along with their strengths and limitations.

    What is TextMatcher in Healthcare NLP?

    TextMatcher is one of the core annotators in Healthcare NLP, designed to match exact phrases within unstructured text. It works by comparing tokenized input against a predefined list of phrases — typically loaded from an external file — and returns matching segments as CHUNK annotations.

    Unlike fuzzy or model-based approaches, TextMatcher operates with high precision by relying on token-level exact matching, making it ideal for use cases where matching consistency and interpretability are essential — such as identifying known drug names, procedure codes, or policy phrases.

    Internally, TextMatcher uses a SearchTrie-based algorithm, which enables fast and efficient lookup of multi-token phrases across large documents and vocabularies.

    To run, TextMatcher requires the following inputs:

    • DOCUMENT: the full text to process
    • TOKEN: the tokenized version of the input

    Its output:

    • CHUNK: the matched phrases found within the text

    This makes TextMatcher both lightweight and highly scalable — perfect for high-volume information extraction pipelines that require exact term recognition without the overhead of deep learning.

    In the next section, we’ll see how TextMatcher builds upon this foundation to support flexible, linguistically-informed matching for more real-world robustness.

    What’s New in TextMatcher?

    TextMatcher has evolved beyond simple exact matching to support a range of linguistically informed and customizable options. These new capabilities are designed to improve both recall and precision when dealing with messy, variable, or naturally flexible language

    Here’s an overview of what’s new in the latest version:

    Lemmatization & Stemming

    Enable normalization of word forms so that variations like “running” and “run”, or “stressors” and “stressor” can be matched correctly.

    • setEnableLemmatizer(True): reduces words to their dictionary base form
    • setEnableStemmer(True): trims suffixes to match root forms

    Example:

    • "running" → matches "run"
    • "studies" → matches "study"

    Token Shuffling

    Match phrases even when the word order changes slightly.

    • setShuffleEntitySubTokens(True): enables all permutations of token order for each entity phrase

    Example:
    Entity: "sleep difficulty"
    Matches:"difficulty sleeping", "sleep difficulty"

    This is especially useful when word order varies across different sentence constructions.

    Stopword Handling

    Reduce false negatives caused by unimportant words such as “and”, “of”, “about”.

    • setCleanStopWords(True): removes common stopwords from both source text and entity list
    • setStopWords([...]): provide a custom stopword list
    • setSafeKeywords([...]): preserve important domain-specific stopWords
    • setCleanKeywords([...]): remove custom noise terms (e.g., "note", "type")
    • setExcludePunctuation(True): drop punctuation during matching

    Example:

    • "evaluation of psychiatric state" → matches "evaluation psychiatric state"

    Augmentation Controls

    Control how much automatic variation the matcher performs on source text and entities.

    • setSkipMatcherAugmentation(True): disables augmentation of entity phrases
    • setSkipSourceTextAugmentation(True): disables augmentation on input text

    Use these settings to optimize for performance when needed.

    Match Output Control

    Specify the format of the returned chunks.

    • setReturnChunks("original"): returns the phrase as it appears in the input
    • setReturnChunks("matched"): returns the normalized matched form (after stemming/lemmatization)

    Note: Regardless of this setting, the begin and end character offsets always refer to the original text.

    Enhanced Text Matching in Action: A Comparative Look

    In this section, we demonstrate how enabling these options significantly improves text matching by comparing a baseline matcher with a fully enhanced configuration.

    Example Input:

    text = """
    Patient was able to talk briefly about recent life stressors during evaluation
    of psychiatric state. She reports difficulty sleeping and ongoing anxiety.
    Denies suicidal ideation.
    """

    And a list of phrases we want to detect:

    test_phrases = """
    stressor
    suicidal deny
    sleep difficulty
    evaluation psychiatric state
    anxiety
    """
    
    with open("test-phrases.txt", "w") as file:
        file.write(test_phrases)

    We define two versions of TextMatcher: one with default (basic) settings, and one with all enhancements enabled.

    # Basic matcher: exact token matching only
    text_matcher_basic = TextMatcherInternal()\
        .setInputCols(["sentence","token"])\
        .setOutputCol("matched_text_basic")\
        .setEntities("./test-phrases.txt")
    
    # Enhanced matcher with all advanced options enabled
    text_matcher_enhanced = TextMatcherInternal()\
        .setInputCols(["sentence","token"])\
        .setOutputCol("matched_text_enhanced")\
        .setEntities("./test-phrases.txt")\
        .setEnableLemmatizer(True)\
        .setEnableStemmer(True)\
        .setCleanStopWords(True)\
        .setBuildFromTokens(False)\
        .setReturnChunks("matched")\
        .setShuffleEntitySubTokens(True)

    Results: Basic Matcher

    +------+-----+---+-------+
    |entity|begin|end|result |
    +------+-----+---+-------+
    |entity|146  |152|anxiety|
    +------+-----+---+-------+

    Results: Enhanced Matcher

    +------+-----+---+----------------------------+-------------------------------+
    |entity|begin|end|matched_text                |original                       |
    +------+-----+---+----------------------------+-------------------------------+
    |entity|69   |99 |evaluation psychiatric state|evaluation of psychiatric state|
    |entity|52   |60 |stressor                    |stressors                      |
    |entity|146  |152|anxiety                     |anxiety                        |
    |entity|114  |132|difficulty sleep            |difficulty sleeping            |
    |entity|155  |169|deni suicidal               |Denies suicidal                |
    +------+-----+---+----------------------------+-------------------------------+

    With the advanced settings enabled:

    • Stemming/Lemmatization allows matching "stressors" to "stressor" and "sleeping" to "sleep".
    • Stopword Cleaning enables "evaluation psychiatric state" to match "evaluation of psychiatric state".
    • Token Shuffling ensures "suicidal deny" matches "Denies suicidal" even if the token order is reversed.
    • The matched_text column shows the raw matched text from the input, making it easy to trace.

    Combining TextMatcher with MedicalNerModel

    While pretrained clinical NER models are powerful for extracting standardized biomedical entities, they may not always capture context-specific or custom phrases. That’s where TextMatcher complements the NER model — by allowing you to inject custom vocabulary and domain-specific expressions into your extraction pipeline.

    In this example, we use both models in the same pipeline to demonstrate their synergy.

    Example Input:

    text = """
    HYPERBILIRUBINEMIA: At risk for hyperbilirubinemia due to prematurity.
    Mother is A+ and infant delivered for decreasing fetal movement and
    preeclampsia. Long fingers and toes were detected.Cardiac evaluation revealed
    evidence of bidirectional shunting, suggestive of transitional circulatory
    dynamics. Additionally, a persisting patent ductus arteriosus (PDA) was noted
    during cardiac-related assessments, which is quite common in this demographic.
    """

    Pipeline Components:

    # Pretrained clinical NER model for phenotype/gene recognition
    clinical_ner = MedicalNerModel.pretrained(
      "ner_human_phenotype_gene_clinical_langtest",
        "en", "clinical/models"
    ).setInputCols(["sentence", "token", "embeddings"]) \
     .setOutputCol("ner")
    
    # Custom text matcher with flexible matching logic
    text_matcher = TextMatcherInternal()\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("matcher_chunk")\
        .setEntities("./test-phrases.csv")\
        .setDelimiter("#")\
        .setEnableLemmatizer(True)\
        .setEnableStemmer(False)\
        .setCleanStopWords(True)\
        .setBuildFromTokens(True)\
        .setReturnChunks("original")\
        .setExcludePunctuation(True)\
        .setShuffleEntitySubTokens(False)

    NER Model Output:

    +------+-----+---+----------------------+
    |entity|begin|end|result |
    +------+-----+---+----------------------+
    |HP |33 |50 |hyperbilirubinemia |
    |HP |127 |134|movement |
    |HP |230 |251|bidirectional shunting|
    |HP |337 |353|ductus arteriosus |
    +------+-----+---+----------------------+

    TextMatcher Output:

    +------+-----+---+-------------------------+
    |entity|begin|end|result                   |
    +------+-----+---+-------------------------+
    |HPO   |33   |50 |hyperbilirubinemia       |
    |HPO   |140  |151|preeclampsia             |
    |HPO   |110  |134|decreasing fetal movement|
    |HPO   |154  |165|Long fingers             |
    |HPO   |230  |251|bidirectional shunting   |
    +------+-----+---+-------------------------+

    Additionally, the matcher identified extra mentions such as preeclampsia, Long fingers, and decreasing fetal movement, which were not detected by the NER model.

    We can combine the outputs of the pretrained NER model and the custom text matcher using ChunkMergeModel.

    # Merge chunks from NER and TextMatcher into a single column
    chunk_merger = ChunkMergeModel()\
        .setInputCols("ner_chunk", "matcher_chunk")\
        .setOutputCol("hpo_terms")

    Result:

    +------+-----+---+-------------------------+
    |entity|begin|end|result                   |
    +------+-----+---+-------------------------+
    |HPO   |33   |50 |hyperbilirubinemia       |
    |HPO   |110  |134|decreasing fetal movement|
    |HPO   |140  |151|preeclampsia             |
    |HPO   |154  |165|Long fingers             |
    |HPO   |230  |251|bidirectional shunting   |
    |HPO   |337  |353|ductus arteriosus        |
    +------+-----+---+-------------------------+

    Example Usage

    To see TextMatcher in action, including how to configure it for different scenarios using stemming, lemmatization, stopword cleaning, and token shuffling, check out the official Healthcare NLP notebook below:

    TextMatcher Example — GitHub Notebook

    This notebook walks you through:

    • How to load and prepare entity phrases
    • How to configure matcher parameters
    • Practical clinical text matching examples

    It’s a great starting point for building your own matcher pipeline or experimenting with different matching behaviors on real-world data.

    Conclusion

    TextMatcher bridges the gap between fast, rule-based phrase recognition and the linguistic flexibility often needed in real-world NLP. By combining tools like lemmatization, stemming, stopword handling, token shuffling, and customizable augmentation, it empowers practitioners to build more accurate and resilient matchers without sacrificing performance.

    Together, these features make TextMatcher a powerful tool for any NLP task that demands flexible, accurate, and domain-adapted text matching

    Whether you’re working on information extraction, rule-based tagging, or clinical decision support, this upgrade brings matching one step closer to true language understanding.

    How useful was this post?

    Read our newest articles on AI in healthcare.

    Read more
    Our additional expert:
    Junior Scala Developer at John Snow Labs

    Reliable and verified information compiled by our editorial and professional team. John Snow Labs' Editorial Policy.

    John Snow Labs Launches Martlet.ai, Setting New Standards for Risk Adjustment with Healthcare Large Language Models

    The first of several new spinoff companies, Martlet.ai reimagines how payers and providers approach HCC coding with an on-premise, secure, AI-based solution...
    preloader