An easier API for creating custom #NLP graphs with Spark NLP for Healthcare 3.5.2

18.05.2022

Muhammet Santas

Master’s Degree St. in Artificial Intelligence

TFGraphBuilder annotator to create graphs for training NER, Assertion, Relation Extraction, and Generic Classifier models
Default TF graphs added for AssertionDLApproach to let users train models without custom graphs
New functionalities in ContextualParserApproach
Printing the list of clinical pretrained models and pipelines with a one-liner
New clinical models
Clinical NER model (ner_biomedical_bc2gm)
Clinical ChunkMapper models (abbreviation_mapper, rxnorm_ndc_mapper, drug_brandname_ndc_mapper, rxnorm_action_treatment_mapper)

`TFGraphBuilder` annotator to create graphs for Train NER, Assertion, Relation Extraction, and Generic Classifier Models

We have a new annotator used to create graphs in the model training pipeline. TFGraphBuilder inspects the data and creates the proper graph if a suitable version of TensorFlow (<= 2.7 ) is available. The graph is stored in the defined folder and loaded by the approach.

You can use this builder with MedicalNerApproach, RelationExtractionApproach, AssertionDLApproach, and GenericClassifierApproach

Example:

        graph_folder_path = "./medical_graphs"

        med_ner_graph_builder = TFGraphBuilder()\
            .setModelName("ner_dl")\
            .setInputCols(["sentence", "token", "embeddings"]) \
            .setLabelColumn("label")\
            .setGraphFile("auto")\
            .setHiddenUnitsNumber(20)\
            .setGraphFolder(graph_folder_path)

        med_ner = MedicalNerApproach() \
            ...
            .setGraphFolder(graph_folder)

        medner_pipeline = Pipeline()([
            ...,
            med_ner_graph_builder,
            med_ner    

            ])

For more examples, please check TFGraph Builder Notebook.

Default TF graphs added for `AssertionDLApproach` to let users train models without custom graphs

We added default TF graphs for the AssertionDLApproach to let users train assertion models without specifying any custom TF graph.

Default Graph Features:

Feature Sizes: 100, 200, 768
Number of Classes: 2, 4, 8

New Functionalities in `ContextualParserApproach`

Added .setOptionalContextRules parameter that allows to output regex matches regardless of context match (prefix, suffix configuration).
Allows sending a JSON string of the configuration file to setJsonPath parameter.

Confidence Value Scenarios:

When there is regex match only, the confidence value will be 0.5.
When there are regex and prefix matches together, the confidence value will be > 0.5 depending on the distance between target token and the prefix.
When there are regex and suffix matches together, the confidence value will be > 0.5 depending on the distance between target token and the suffix.
When there are regex, prefix, and suffix matches all together, the confidence value will be > than the other scenarios.

Example:

        jsonString = {
            "entity": "CarId",
            "ruleScope": "sentence",
            "completeMatchRegex": "false",
            "regex": "\\d+",
            "prefix": ["red"],
            "contextLength": 100
        }
        
        with open("jsonString.json", "w") as f:
            json.dump(jsonString, f)
        
        contextual_parser = ContextualParserApproach()\
            .setInputCols(["sentence", "token"])\
            .setOutputCol("entity")\
            .setJsonPath("jsonString.json")\
            .setCaseSensitive(True)\
            .setOptionalContextRules(True)

Printing the List of Clinical Pretrained Models and Pipelines with One-Liner

Now we can check what the clinical model names are of a specific annotator and the names of clinical pretrained pipelines in a language.

Clinical Pipeline Names:

Example:

        from sparknlp_jsl.pretrained import InternalResourceDownloader

        InternalResourceDownloader.showPrivatePipelines("en")

Results:

        
        +--------------------------------------------------------+------+---------+
        | Pipeline                                               | lang | version |
        +--------------------------------------------------------+------+---------+
        | clinical_analysis                                      |  en  | 2.4.0   |
        | clinical_ner_assertion                                 |  en  | 2.4.0   |
        | clinical_deidentification                              |  en  | 2.4.0   |
        | clinical_analysis                                      |  en  | 2.4.0   |
        | explain_clinical_doc_ade                               |  en  | 2.7.3   |
        | icd10cm_snomed_mapping                                 |  en  | 2.7.5   |
        | recognize_entities_posology                            |  en  | 3.0.0   |
        | explain_clinical_doc_carp                              |  en  | 3.0.0   |
        | recognize_entities_posology                            |  en  | 3.0.0   |
        | explain_clinical_doc_ade                               |  en  | 3.0.0   |
        | explain_clinical_doc_era                               |  en  | 3.0.0   |
        | icd10cm_snomed_mapping                                 |  en  | 3.0.2   |
        | snomed_icd10cm_mapping                                 |  en  | 3.0.2   |
        | icd10cm_umls_mapping                                   |  en  | 3.0.2   |
        | snomed_umls_mapping                                    |  en  | 3.0.2   |
        | …                                                      | …    | …       |
        +--------------------------------------------------------+------+---------+

New `ner_biomedical_bc2gm` NER Model

This model has been trained to extract genes/proteins from a medical text.

See Model Card for more details.

Example:

 
        ...
        ner = MedicalNerModel.pretrained("ner_biomedical_bc2gm", "en", "clinical/models")\
            .setInputCols(["sentence", "token", "embeddings"]) \
            .setOutputCol("ner")
        ...

        text = spark.createDataFrame([["Immunohistochemical staining was positive for S-100 in all 9 cases stained, positive for HMB-45 in 9 (90%) of 10, and negative for cytokeratin in all 9 cases in which myxoid melanoma remained in the block after previous sections."]]).toDF("text")

        result = model.transform(text)

Results:

         
        +-----------+------------+
        |chunk      |ner_label   |
        +-----------+------------+
        |S-100      |GENE_PROTEIN|
        |HMB-45     |GENE_PROTEIN|
        |cytokeratin|GENE_PROTEIN|
        +-----------+------------+

New Clinical `ChunkMapper` Models

We have 4 new ChunkMapper models and a new Chunk Mapping Notebook for showing their examples.

drug_brandname_ndc_mapper: This model maps drug brand names to corresponding National Drug Codes (NDC). Product NDCs for each strength are returned in results and metadata.

See Model Card for more details.

Example:

         
        document_assembler = DocumentAssembler()\
            .setInputCol("text")\
            .setOutputCol("chunk")

        chunkerMapper = ChunkMapperModel.pretrained("drug_brandname_ndc_mapper", "en", "clinical/models")\
            .setInputCols(["chunk"])\
            .setOutputCol("ndc")\
            .setRel("Strength_NDC")

        model = PipelineModel(stages=[document_assembler,
                                        chunkerMapper])  

        light_model = LightPipeline(model)
        res = light_model.fullAnnotate(["zytiga", "ZYVOX", "ZYTIGA"])

Results:

         
        +-------------+--------------------------+-----------------------------------------------------------+
        | Brandname   | Strenth_NDC              | Other_NDSs                                                |
        +-------------+--------------------------+-----------------------------------------------------------+
        | zytiga      | 500 mg/1 | 57894-195     | ['250 mg/1 | 57894-150']                                  |
        | ZYVOX       | 600 mg/300mL | 0009-4992 | ['600 mg/300mL | 66298-7807', '600 mg/300mL | 0009-7807'] |
        | ZYTIGA      | 500 mg/1 | 57894-195     | ['250 mg/1 | 57894-150']                                  |
        +-------------+--------------------------+-----------------------------------------------------------+

abbreviation_mapper: This model maps abbreviations and acronyms of medical regulatory activities with their definitions.

See Model Card for details.

Example:

         
        input = ["""Gravid with estimated fetal weight of 6-6/12 pounds.
        LABORATORY DATA: Laboratory tests include a CBC which is normal. 
        HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]
      
        >> output:
        +------------+----------------------------+
        |Abbreviation|Definition                  |
        +------------+----------------------------+
        |CBC         |complete blood count        |
        |HIV         |human immunodeficiency virus|
        +------------+----------------------------+

rxnorm_action_treatment_mapper: RxNorm and RxNorm Extension codes with their corresponding action and treatment. Action refers to the function of the drug in various body systems; treatment refers to which disease the drug is used to treat.

See Model Card for details.

Example:

  
        input = ['Sinequan 150 MG', 'Zonalon 50 mg']
          
        >> output:
        +---------------+------------+---------------+
        |chunk          |rxnorm_code |Action         |
        +---------------+------------+---------------+
        |Sinequan 150 MG|1000067     |Antidepressant |
        |Zonalon 50 mg  |103971      |Analgesic      |
        +---------------+------------+---------------+

rxnorm_ndc_mapper: This pretrained model maps RxNorm and RxNorm Extension codes with corresponding National Drug Codes (NDC).

See Model Card for details.

Example:

        input = ['doxepin hydrochloride 50 MG/ML', 'macadamia nut 100 MG/ML']
          
        >> output:
        +------------------------------+------------+------------+
        |chunk                         |rxnorm_code |Product NDC |
        +------------------------------+------------+------------+
        |doxepin hydrochloride 50 MG/ML|1000091     |00378-8117  |
        |macadamia nut 100 MG/ML       |212433      |00064-2120  |
        +------------------------------+------------+------------+

Try free

Muhammet Santas

Master’s Degree St. in Artificial Intelligence

Our additional expert:

Muhammet Santas has a Master’s Degree St. in Artificial Intelligence and works as a Data Scientist at John Snow Labs as part of the Healthcare NLP Team.

Comparison of Key Medical NLP Benchmarks — Spark NLP vs AWS, Google Cloud and Azure

Veysel Kocaman

Spark NLP for Healthcare comes with 600+ pretrained clinical pipelines & models out of the box and is consistently making 4–6x less...

An easier API for creating custom #NLP graphs with Spark NLP for Healthcare 3.5.2

TFGraphBuilder annotator to create graphs for Train NER, Assertion, Relation Extraction, and Generic Classifier Models

Default TF graphs added for AssertionDLApproach to let users train models without custom graphs

Default Graph Features:

New Functionalities in ContextualParserApproach

Confidence Value Scenarios:

Printing the List of Clinical Pretrained Models and Pipelines with One-Liner

Clinical Pipeline Names:

New ner_biomedical_bc2gm NER Model

New Clinical ChunkMapper Models

Try free

Comparison of Key Medical NLP Benchmarks — Spark NLP vs AWS, Google Cloud and Azure

Recommended For You

`TFGraphBuilder` annotator to create graphs for Train NER, Assertion, Relation Extraction, and Generic Classifier Models

Default TF graphs added for `AssertionDLApproach` to let users train models without custom graphs

New Functionalities in `ContextualParserApproach`

New `ner_biomedical_bc2gm` NER Model

New Clinical `ChunkMapper` Models