Webinars

Webinars presented by John Snow Labs
Watch live:
September 7, 2022 @ 2:00 PM ET

Register now

Automated Text Generation & Data-Augmentation for Medicine, Finance, Law, and E-Commerce

This webinar teaches you how to leverage the human-level text generation capabilities of Large Transformer models to increase the accuracy of most NLP classifiers for Medicine, Finance, and Legal datasets with the Spark NLP library.

We will also explore the next-generation capabilities for E-Commerce and creative writing to enable the creation of automated marketing text.

Additionally, you will learn and understand intuitively how the mysterious generation parameters Temperature, Top-K-Sampling and Top-P-Nucleus sampling influence drawing from WordDistributions influences the generated text of Transformer Models. All automates and scales effortlessly to Industry Scale GPU or CPU clusters with the underlying Spark Engine.

Christian Kasim Loan

Christian Kasim Loan is a Lead Data Scientist and Scala expert at John Snow Labs and a Computer Scientist with over a decade of experience in software and worked on various projects in Big Data, Data Science and Blockchain using modern technologies such as Kubernetes, Docker, Spark, Kafka, Hadoop, Ethereum, and almost 20 programming languages to create modern cloud-agnostic AI solutions, decentralized applications, and analytical dashboards.

He has deep knowledge of Time-Series Graphs from his previous research in scalable and accurate traffic flow prediction and working on various Spatio-Temporal problems embedded in graphs at a Daimler lab.

Before his graph research, he worked on scalable meta machine learning, visual emotion extraction, and chatbots for various use cases at the Distributed Artificial Intelligence lab (DAI) in Berlin.

His most recent work includes the NLU library, which democratizes 5000+ state-of-the-art NLP models in 200+ languages in just 1 line of code for dozens of domains, with built-in visualizations and all scalable natively in Spark Clusters by its underlying Spark NLP distribution engine.

Recorded on:
July 27, 2022 @ 2:00 PM ET

Watch recording

Text classification and named entity recognition with BertForTokenClassification & BertForSequenceClassification

Recognizing entities is a fundamental step towards understanding unstructured data in documents. Spark NLP includes state-of-the-art BERT-based models for token classification and sequence classification.

This session will cover the background and motivation behind these models, practical code implementations and introduce you to some of the pretrained models available in Spark NLP and its licensed counterpart, Spark NLP for Healthcare.

Luca Martial

Luca Martial is a Data Scientist at John Snow Labs. In this role, he has been building custom NLP solutions to showcase John Snow Labs’ healthcare library capabilities to customers, and training Spark NLP models for named entity recognition, relation extraction, text classification, de-identification and clinical entity resolution of medical notes and reports.

Recorded on:
June 22, 2022 @ 2:00 PM ET

Watch recording

Zero Shot Learning for Semantic Relation Extraction from Unstructured Text

Relation Extraction, which is one of the most important tasks of NLP applications in healthcare, is an expensive process to find competent people who can label the data and label the data in order to train the models. By using the Zero-Shot Learning method, which has recently been used in the field of NLP, it has now become possible to train Relation Extraction models without the need for data labeling. In this presentation, we will explain how to use the Zero-Shot Learning method for Relation Extraction in unstructured texts.

Muhammet Santas

Muhammet Santas has a Master’s Degree St. in Artificial Intelligence and works as a Data Scientist at John Snow Labs as part of the Healthcare NLP Team.

Recorded on:
May 25, 2022 @ 1:00 PM ET

Watch recording

Building Real-World Healthcare AI Projects from Concept to Production

In this Webinar, Juan Martinez from John Snow Labs and Ken Puffer from ePlus will share lessons learned from recent AI, ML, and NLP projects that have been successfully built & deployed in US hospital systems:

  • Improving patient flow forecasting at Kaiser Permanente
  • A real-time clinical decision support platform for Psychiatry and Oncology at Mount Sinai
  • Automated de-identification of 700 million patient notes at Providence Health

Then they will showcase a live demo of the recently launched AI Workflow Accelerator Bundle for Healthcare, which provides a complete data science platform including supporting the full AI lifecycle:

  • Data analysis: Enable data analysts to query, visualize & build dashboards without coding
  • Data science: Enable data scientists to train models, share & scale experiments
  • Model deployment options
  • Operations: Enable DevOps & DataOps engineers to monitor, secure, and scale

The bundle is a turnkey solution composed of GPU-accelerated hardware from NVIDIA, proprietary software from John Snow Labs, and implementation services from ePlus. It is unique in providing all of the following healthcare-specific capabilities out of the box:

  • 2,300+ current, clean, and enriched healthcare datasets – from ontologies to benchmarks
  • Spark NLP for Healthcare – the most widely used NLP library in the healthcare industry – along with 250+ pre-trained clinical & biomedical NLP models for analyzing unstructured data
  • Spark OCR – including the ability to read, de-identify, and extract information from DICOM images
  • Security controls implemented within the platform, to enable a team of data scientists to effectively work & collaborate in air-gap, high-compliance environments

We will share speed & accuracy benchmarks measuring the optimization of John Snow Labs software and models on the GPU-accelerated Nvidia hardware – and how this translates to enabling your AI team to deliver bigger projects faster.

Juan Martinez

Juan Martinez is a Sr. Data Scientist, working at John Snow Labs since 2021. He graduated from Computer Engineering in 2006, and from that time on, his main focus of activity has been the application of Artificial Intelligence to texts and unstructured data. To better understand the intersection between Language and AI, he complemented his technical background with a Linguistics degree from Moscow Pushkin State Language Institute in 2012 and later on on University of Alcala (2014).

He is part of the Healthcare Data Science team at John Snow Labs. His main activities are training and evaluation of Deep Learning, Semantic and Symbolic models within the Healthcare domain, benchmarking, research and team coordination tasks. His other areas of interest are Machine Learning operations and Infrastructure.

Ken Puffer

Ken Puffer is the Chief Technology Officer for Healthcare solutions at ePlus. In this role, Ken consults with a broad range of healthcare leaders and technology partners to help ePlus develop, deploy, optimize, and maintain solutions that help solve the unique challenges facing healthcare.

Recorded on:
March 16, 2022 @ 2:00 PM ET

Watch recording

Deeper Clinical Document Understanding Using Relation Extraction

Recognizing entities is a fundamental step towards understanding a piece of text – but entities alone only tell half the story. The other half comes from explaining the relationships between entities. Spark NLP for Healthcare includes state-of-the-art (SOTA) deep learning models that address this issue by semantically relating entities in unstructured data.

John Snow Labs has developed multiple models utilizing BERT architectures with custom feature generation to achieve peer-reviewed SOTA accuracy on multiple benchmark datasets. This session will shed light on the background and motivation behind relation extraction, techniques, real-world use cases, and practical code implementation.

Hasham Ul Haq

Hasham Ul Haq is a Data Scientist at John Snow Labs, and an AI scholar and researcher at PI School of AI. During his carrier, he has worked on numerous projects across various sectors, including healthcare. At John Snow Labs, his primary focus is to build scalable and pragmatic systems for NLP, that are both, production-ready, and give SOTA performance. In particular, he has been working on Natural Language Inference, disambiguation, Named Entity Recognition, and a lot more! Hasham also has an active research profile with a publications in NeurIPS, AAAI, and multiple scholarship grants and affiliations.

Prior to John Snow Labs, he was leading search engine and knowledge base development at one of Europe’s largest telecom providers. He has also been mentoring startups in computer vision by providing trainings and designing ML architectures.

Recorded on:
February 16, 2022 @ 2:00 PM ET

Watch recording

Rule-Based and Pattern Matching for Entity Recognition in Spark NLP

Finding patterns and matching strategies are well-known NLP procedures to extract information from text.

Spark NLP library has two annotators that can use these techniques to extract relevant information or recognize entities of interest in large-scale environments when dealing with lots of documents from medical records, web pages, or data gathered from social media.

In this talk, we will see how to retrieve the information we are looking for by using the following annotators:

  • Entity Ruler, an annotator available in open-source Spark NLP.
  • Contextual Parser, an annotator available only in Spark NLP for Healthcare.
  • In addition, we will enumerate use cases where we can apply these annotators.

After this webinar, you will know when to use a rule approach to extract information from your data and the best way to set the available parameters in these annotators.

Danilo Burbano

Danilo Burbano is a Software and Machine Learning Engineer at John Snow Labs. He holds an MSc in Computer Science and has 12 years of commercial experience.

He has previously developed several software solutions over distributed system environments like microservices and big data pipelines across different industries and countries. Danilo has contributed to Spark NLP for the last 4 years. He is now working to maintain and evolve the Spark NLP library by continuously adding state-of-the-art NLP tools to allow the community to implement and deploy cutting-edge large-scale projects for AI and NLP.

Recorded on:
January 12, 2022 @ 2:00 PM ET

Watch recording

Automating Clinical Trial Master File Migration & Information Extraction

Pharmaceutical Companies who conduct clinical trials, looking to get new treatments to market as quickly as possible, possess a high volume of documents. Millions of documents can be created as part of one trial and are stored in a document management system. In case migrating these documents to a new system is needed – for example, when a pharma company acquires the rights to a drug or trial – all these documents must often be read manually in order to classify them and extract metadata that is legally required and must be accurate. Traditionally, this migration is a long, complex, and labor-intensive process.

We present a solution based on the natural language processing (NLP) system which provides:

  • Speed – 80% reduction of manual labor and migration timeline, proven in major real-world projects
  • State of the art accuracy – based on Spark NLP for Healthcare, integrated in a human-in-the-loop solution
  • End-to-end, secure and compliant solution – Air-gap deployment, GxP and GAMP 5 validated

We will share lessons learned from an end-to-end migration process of the trial master file in Novartis.

Jiri Dobes

Jiri Dobes is the Head of Solutions in John Snow Labs. He has been leading the development of machine learning solutions in healthcare and other domains for the past five years. Jiri is a PMP certified project manager. His previous experience includes delivering large projects in the power generation sector and consulting for the Boston Consulting Group and large pharma. Jiri holds a Ph.D. in mathematical modeling.

Recorded on:
December 15 2021 @ 2:00p.m ET

Watch recording

Enterprise-Scale Data Labeling & Automated Model Training with the Free Annotation Lab

Extracting data from unstructured documents is a common requirement – from finance and insurance to pharma and healthcare. Recent advances in deep learning offer impressive results on this task when models are trained on large enough datasets.

However, getting high-quality data involves a lot of manual effort. An annotation project is defined, annotation guidelines are specified, documents are imported, tasks are distributed among domain experts, a manager tracks the team’s performance, inter-annotator agreement is reached, and the resulting annotations are exported into a standard format. At enterprise-scale, complexity grows due to the volume of projects, tasks, and users.

John Snow Labs’ Annotation Lab is a free annotation tool that has already been deployed and used by large-scale enterprises for three years. This webinar presents how you can exploit the tool’s capabilities to easily manage any annotation project – from small team to enterprise-wide. It also shows how models can be trained automatically, without writing a single line of code, and how any pre-trained model can be used to pre-annotate documents to speed up projects by 5x – since domain experts don’t start annotating from scratch but correct and improve the models, as part of a no-code human-in-the-loop AI workflow.

Nabin Khadka

Nabin Khada leads the team building the Annotation Lab at John Snow Labs. He has 7 years of experience as a software engineer, covering a broad range of technologies from web & mobile apps to distributed systems and large-scale machine learning.

Recorded on:
November 17 2021 @ 2:00p.m ET

Watch recording

Creating a Clinical Knowledge Graph with Spark NLP and neo4j

The knowledge graph represents a collection of connected entities and their relations. A knowledge graph that is fueled by machine learning utilizes natural language processing to construct a comprehensive and semantic view of the entities. A complete knowledge graph allows answering and search systems to retrieve answers to given queries. In this study, we built a knowledge graph using Spark NLP models and Neo4j. The marriage of Spark NLP and Neo4j is very promising for creating clinical knowledge graphs to do a deeper analysis, Q&A tasks, and get insights.

Ali Emre Varol

Ali Emre Varol is a data scientist working on Spark NLP for Healthcare at John Snow Labs with a decade of industry experience. He has previously worked as a software engineer to develop ERP solutions and led teams and projects building machine learning solutions in a variety of industries. He is also pursuing his Ph.D. in Industrial Engineering at Middle East Technical University and holds an MS degree in Industrial Engineering.

Recorded on:
Sept 16 2021 @ 2:00p.m ET

Watch recording

1 Line of Code to Use 200+ State-of-the-Art Clinical & Biomedical NLP Models

In this Webinar, Christian Kasim Loan will teach you how to leverage the hundreds of medical State-of-the-Art models for various Medical and Healthcare domains in 1 line of code like Named Entity Recognition (NER) for Adverse Drug Events, Anatomy, Diseases, Chemicals, Clinical Events, Human Phenotypes, Posology, Radiology, Measurements, and many other fields plus the best in class resolution algorithms to map the extracted entities into medical code terminologies like ICD10, ICD0, RXNORM, SNOMED, LOINC, and many more.

Additionally, we will showcase how to extract the relationship between predicted entities for the Posology, Drug Adverse Effects, Temporal Features, Body Party problems, Procedures domains, and how to De-Identify your text documents.

Finally, we will take a look at the latest NLU Streamlit features and how you can leverage them to visualize all model predictions and test them out with 0 lines of code in your web browser!

Christian Kasim Loan

Data Scientist and Spark/Scala ML engineer

Recorded on:
August 11th 2021 @ 2:00 p.m ET

Watch recording

Accurate Table Extraction from Documents & Images with Spark OCR

Extracting data formatted as a table (tabular data) is a common task — whether you’re analyzing financial statements, academic research papers, or clinical trial documentation. Table-based information varies heavily in appearance, fonts, borders, and layouts. This makes the data extraction task challenging even when the text is searchable – but more so when the table is only available as an image.

This webinar presents how Spark OCR automatically extracts tabular data from images. This end-to-end solution includes computer vision models for table detection and table structure recognition, as well as OCR models for extracting text & numbers from each cell. The implemented approach provides state-of-the-art accuracy for the ICDAR 2013 and TableBank benchmark datasets.

Mykola Melnyk

Mykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience. He has led teams and projects building machine learning and big data solutions in a variety of industries – and is currently the lead developer of the Spark OCR library at John Snow Labs.

Recorded on:
Wednesday, June 16 2021 @ 2:00 p.m ET

Watch recording

Speed Optimization & Benchmarks in Spark NLP 3: Making the Most of Modern Hardware

Spark NLP is the most widely used NLP library in the enterprise, thanks to implementing production-grade, trainable, and scalable versions of state-of-the-art deep learning & transfer learning NLP research. It is also Open Source with a permissive Apache 2.0 license that officially supports Python, Java, and Scala languages backed by a highly active community and JSL members.
Spark NLP library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking, multi-class and multi-label text classification, sentiment analysis, emotion detection, unsupervised keyword extraction, and state-of-the-art Transformers such as BERT, ELECTRA, ELMO, ALBERT, XLNet, and Universal Sentence Encoder.

The latest release of Spark NLP 3.0 comes with over 1100+ pretrained models, pipelines, and Transformers in 190+ different languages. It also delivers massive speeds up on both CPU & GPU devices while extending support for the latest computing platforms such as new Databricks runtimes and EMR versions.

The talk will focus on how to scale Apache Spark / PySpark applications in YARN clusters, use GPU in Databricks new Apache Spark 3.x runtimes, and manage large-scale datasets in resource-demanding NLP applications efficiently. We will share benchmarks, tips & tricks, and lessons learned when scaling Spark NLP.

Maziyar Panahi

Maziyar Panahi is a Senior Data Scientist and Spark NLP Lead at John Snow Labs with over a decade long experience in public research. He is a senior Big Data engineer and a Cloud architect with extensive experience in computer networks and software engineering. He has been developing software and planning networks for the last 15 years. In the past, he also worked as a network engineer in high-level places after he completed his Microsoft and Cisco training (MCSE, MCSA, and CCNA).

He has been designing and implementing large-scale databases and real-time Web services in public and private Clouds such as AWS, Azure, and OpenStack for the past decade. He is one of the early adopters and main maintainers of the Spark NLP library. He is currently employed by The French National Centre for Scientific Research (CNRS) as a Big Data engineer and System/Network Administrator working at the Institute of Complex Systems of Paris (ISCPIF).

Recorded on:
Wednesday, May 12 2021 @ 2:00 p.m ET

Watch recording

Visual Document Understanding with Multi-Modal Image & Text Mining in Spark OCR 3

The Transformer architecture in NLP has truly changed the way we analyze text. NLP models are great at processing digital text, but many real-word applications use documents with more complex formats. For example, healthcare systems often include visual lab results, sequencing reports, clinical trial forms, and other scanned documents. When we only use an NLP approach for document understanding, we lose layout and style information – which can be vital for document image understanding. New advances in multi-modal learning allow models to learn from both the text in documents (via NLP) and visual layout (via computer vision).

We provide multi-modal visual document understanding, built on Spark OCR based on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding (from 70.7 to 79.3), receipt understanding (from 94.0 to 95.2) and document image classification (from 93.1 to 94.4).

Mykola Melnyk

Mykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience. He has led teams and projects building machine learning and big data solutions in a variety of industries – and is currently the lead developer of the Spark OCR library at John Snow Labs.

Recorded on:
March 10 2021 at 2pm EST

Watch recording

Using & Expanding the NLP Models Hub

The NLP Models Hub which powers the Spark NLP and NLU libraries takes a different approach than the hubs of other libraries like TensorFlow, PyTorch, and Hugging Face. While it also provides an easy-to-use interface to find, understand, and reuse pre-trained models, it focuses on providing production-grade state-of-the-art models for each NLP task instead of a comprehensive archive.

This implies a higher quality bar for accepting community contributions to the NLP Models Hub – in terms of automated testing, level of documentation, and transparency of accuracy metrics and training datasets. This webinar shows how you can make the most of it, whether you’re looking to easily reuse models or contribute new ones.

Dia Trambitas

Dia Trambitas is a computer scientist with a rich background in Natural Language Processing. She has a Ph.D. in Semantic Web from the University of Grenoble, France, where she worked on ways of describing spatial and temporal data using OWL ontologies and reasoning based on semantic annotations. She then changed her interest to text processing and data extraction from unstructured documents, a subject she has been working on for the last 10 years. She has a rich experience working with different annotation tools and leading document classification and NER extraction projects in verticals such as Finance, Investment, Banking, and Healthcare.

Recorded on:
February 18 2021 at 2pm EST

Watch recording

State-of-the-art Natural Language Processing for 200+ Languages with 1 Line of code

Learn to harness the power of 1,000+ production-grade & scalable NLP models for 200+ languages – all available with just 1 line of Python code by leveraging the open-source NLU library, which is powered by the widely popular Spark NLP.

John Snow Labs has delivered over 80 releases of Spark NLP to date, making it the most widely used NLP library in the enterprise and providing the AI community with state-of-the-art accuracy and scale for a variety of common NLP tasks. The most recent releases include pre-trained models for over 200 languages – including languages that do not use spaces for word segmentation algorithms like Chinese, Japanese, and Korean, and languages written from right to left like Arabic, Farsi, Urdu, and Hebrew. All software and models are free and open source under an Apache 2.0 license.

This webinar will show you how to leverage the multi-lingual capabilities of Spark NLP & NLU – including automated language detection for up to 375 languages, and the ability to perform translation, named entity recognition, stopword removal, lemmatization, and more in a variety of language families. We will create Python code in real-time and solve these problems in just 30 minutes. The notebooks will then be made freely available online.

Christian Kasim Loan

Data Scientist and Spark/Scala ML engineer

Recorded on:
January 13 2021 at 2pm EST

Watch recording

Automated Drug Adverse Event Detection from Unstructured Text

Adverse Drug Events (ADEs) are potentially very dangerous to patients and are amongst the top causes of morbidity and mortality. Monitoring & reporting of ADEs is required by pharma companies and healthcare providers. This session introduces new state-of-the-art deep learning models for automatically detecting if a free-text paragraph includes an ADE (document classification), as well as extracting the key terms of the event in structured form (named entity recognition). Using live Python notebooks and real examples from clinical and conversational text, we’ll show how to apply these models using the Spark NLP for Healthcare library.

Julio Bonis

Julio Bonis is a data scientist working on Spark NLP for Healthcare at John Snow Labs. Julio has broad experience in software development and design of complex data products within the scope of Real World Evidence (RWE) and Natural Language Processing (NLP). He also has substantial clinical and management experience – including entrepreneurship and Medical Affairs. Julio is a medical doctor specialized in Family Medicine (registered GP), has an Executive MBA – IESE, an MSc in Bioinformatics, and an MSc in Epidemiology.

Recorded on:
November 12 2020 at 2pm EST

Watch recording

John Snow Labs NLU: Become a Data Science Superhero with One Line of Python code

Learn how to unleash the power of 350+ pre-trained NLP models, 100+ Word Embeddings, 50+ Sentence Embeddings, and 50+ Classifiers in 46 languages with 1 line of Python code. John Snow Labs’ new NLU library marries the power of Spark NLP with the simplicity of Python. Tackle NLP tasks like NER, POS, Emotion Analysis, Keyword extraction, Question answering, Sarcasm Detection, Document classification using state-of-the-art techniques. The end-to-end library includes word & sentence embeddings like BERT, ELMO, ALBERT, XLNET, ELECTRA, USE, Small-BERT, and others; text wrangling and cleaning like tokenization, chunking, lemmatizing, stemming, normalizing, spell-checking, and matchers; and easy visualization capabilities using your embedded data with T-SNE.

Christian Kasim Loan, the creator of NLU, will walk through NLU and show you how easy it is to generate T-SNE visualizations of 6 Deep Learning Embeddings, achieve top classification results on text problems from Kaggle competition with 1 line of NLU code, and leverage the latest & greatest advances in deep learning & transfer learning.

Christian Kasim Loan

Data Scientist and Spark/Scala ML engineer

Recorded on:
September 16 2020 at 2pm EST

Watch recording

Answering natural language questions

The ability to directly answer medical questions asked in natural language either about a single entity (“what drugs has this patient been prescribed?”) or a set of entities (“list stage 4 lung cancer patients with no history of smoking”) has been a longstanding industry goal, given its broad applicability across many use cases.

This webinar presents a software solution, based on state-of-the-art deep learning and transfer learning research, for translating natural language questions to SQL statements. An actual case study will be a system which answers clinical questions by training domain-specific models and learning from reference data. This is a production-grade, trainable and scalable capability of Spark NLP Enterprise. Live notebooks will be shared to explain how you can use it in your own projects.

Prabod Rathnayaka

Graduate Research Assistant and PhD Student at La Trobe University

Recorded on:
August 19 2020 at 2pm EST

Watch recording

Accurate de-identification, obfuscation, and editing of scanned medical documents and images

One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results, referrals, or genetic testing to DICOM files with medical imaging. These files are challenging to de-identify, because personal health information (PHI) can appear anywhere in free text – so cannot be removed with rules or regular expressions – or “burned” into images so that it’s not even available as digital text to begin with.

This webinar presents a software system that tackles these challenges, with lessons learned from applying it in real-world production systems. The workflow uses:

  • Spark OCR to extract both digital and scanned text from PDF and DICOM files
  • Spark NLP for Healthcare to recognize sensitive data in the extracted free text
  • The de-identification module to delete, replace, or obfuscate PHI
  • Spark OCR to generate new PDF or DICOM file with the de-identified data
  • Run the whole workflow within a local secure environment, with no need to share data with any third party or a public cloud API

Dr. Alina Petukhova

Data Scientist at John Snow Labs

Recorded on:
July 22 2020 at 2pm EST

Watch recording

Hardening a Cleanroom AI Platform to allow model training & inference on Protected Health Information

Artificial intelligence projects in high-compliance industries, like healthcare and life science, often require processing Protected Health Information (PHI). This may happen because the nature of the projects does not allow full de-identification in advance – for example, when dealing with rare diseases, genetic sequencing data, identify theft, or training de-identification models – or when training is anonymized data but inference must happen on data with PHI.

In such scenarios, the alternative is to create an “AI cleanroom” – an isolated, hardened, air-gap environment where the work happens. Such a software platform should enable data scientists to log into the cleanroom, and do all the development work inside it – from initial data exploration & experimentation to model deployment & operations – while no data, computation, or generated assets ever leave the cleanroom.

This webinar presents the architecture of such a Cleanroom AI Platform, which has been actively used by Fortune 500 companies for the past three years. Second, it will survey the hundreds of DevOps & SecOps features requires to realize such a platform – from multi-factor authentication and point-to-point encryption to vulnerability scanning and network isolation. Third, it will explain how a Kubernetes-based architecture enables “Cleanroom AI” without giving up on the main benefits of cloud computing: elasticity, scalability, turnkey deployment, and a fully managed environment.

Ali Naqvi

Ali Naqvi is the lead product manager of the AI Platform at John Snow Labs. Ali has extensive experience building end-to-end data science platform & solution for the healthcare and life science industries, using modern technology stacks such as Kubernetes, TensorFlow, Spark, mlFlow, Elastic, Nifi, and related tools. Ali has a Master’s degree in Molecular Science and over a decade of hands-on experience in software engineering and academic research.

Recorded on:
June 24 2020 at 2pm EST

Watch recording

Maximizing Text Recognition Accuracy with Image Transformers in Spark OCR

Spark OCR is an object character recognition library that can scale natively on any Spark cluster; enables processing documents privately without uploading them to a cloud service; and most importantly, provides state-of-the-art accuracy for a variety of common use cases. A primary method of maximizing accuracy is using a set of pre-built image pre-processing transformers – for noise reduction, skew correction, object removal, automated scaling, erosion, binarization, and dilation. These transformers can be combined into OCR pipelines that effectively resolve common ‘document noise’ issues that reduce OCR accuracy.

This webinar describes real-world OCR use cases, common accuracy issues they bring, and how to use image transformers in Spark OCR in order to resolve them at scale. Example Python code will be shared using executable notebooks that will be made publicly available.

Mykola Melnyk

Mykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience. He has led teams and projects building machine learning and big data solutions in a variety of industries – and is currently the lead developer of the Spark OCR library at John Snow Labs.

Recorded on:
May 27 2020 at 2pm EST

Watch recording

Best Practices & Tools for Accurate Document Annotation and Data Abstraction

Are you working on machine learning tasks such as sentiment analysis, named entity recognition, text classification, image classification or audio segmentation? If so, you need training data adapted for your particular domain and task.

This webinar will explain the best practices and strategies for getting the training data you need. We will go over the setup of the annotation team, the workflows that need to be in place for guaranteeing high accuracy and labeler agreement, and the tools that will help you increase productivity and eliminate errors.

Dia Trambitas

Dia Trambitas is a computer scientist with a rich background in Natural Language Processing. She has a Ph.D. in Semantic Web from the University of Grenoble, France, where she worked on ways of describing spatial and temporal data using OWL ontologies and reasoning based on semantic annotations. She then changed her interest to text processing and data extraction from unstructured documents, a subject she has been working on for the last 10 years. She has a rich experience working with different annotation tools and leading document classification and NER extraction projects in verticals such as Finance, Investment, Banking, and Healthcare.

Recorded on:
April 29 2020 at 2pm EST

Watch recording

Automated Mapping of Clinical Entities from Natural Language Text to Medical Terminologies

The immense variety of terms, jargon, and acronyms used in medical documents means that named entity recognition of diseases, drugs, procedures, and other clinical entities isn’t enough for most real-world healthcare AI applications. For example, knowing that “renal insufficiency”, “decreased renal function” and “renal failure” should be mapped to the same code, before using that code as a feature in a patient risk prediction or clinical guidelines recommendation model, is critical to that’s model’s accuracy. Without it, the training algorithm will see these three terms as three separate features and will severely under-estimate the relevance of this condition.

This need for entity resolution, also known as entity normalization, is therefore a key requirement from a healthcare NLP library. This webinar explains how Spark NLP for Healthcare addresses this issue by providing trainable, deep-learning-based, clinical entity resolution, as well as pre-trained models for the most commonly used medical terminologies: SNOMED-CT, RxNorm, ICD-10-CM, ICD-10-PCS, and CPT.

Andres Fernandez

Andrés Fernández is a Machine Learning Engineer and Data Scientist at John Snow Labs with 10 years of experience in the Finance, Retail and Healthcare industries.

After his MSc in Software Engineering at the University of Málaga, he has been helping Latin American and USA companies conceptualize, design and build AI solutions to automate their operations in functions like Insurance Claims, Pricing, Retail Procurement, Marketing, and others. Andrés has dedicated the last 5 years of his experience to deal with real-world applications for Natural Language Processing focusing mainly on Log Processing, Text Clustering, and Entity Resolution.

Recorded on:
April 8 2020 at 2pm EST

Watch recording

AI Model Governance in a High-Compliance Industry

Model governance defines a collection of best practices for data science – versioning, reproducibility, experiment tracking, automated CI/CD, and others. Within a high-compliance setting where the data used for training or inference contains private health information (PHI) or similarly sensitive data, additional requirements such as strong identity management, role-based access control, approval workflows, and full audit trail are added.

This webinar summarizes requirements and best practices for establishing a high-productivity data science team within a high-compliance environment. It then demonstrates how these requirements can be met using John Snow Labs’ Healthcare AI Platform.

Ali Naqvi

Ali Naqvi is the lead product manager of the AI Platform at John Snow Labs. Ali has extensive experience building end-to-end data science platform & solution for the healthcare and life science industries, using modern technology stacks such as Kubernetes, TensorFlow, Spark, mlFlow, Elastic, Nifi, and related tools. Ali has a Master’s degree in Molecular Science and over a decade of hands-on experience in software engineering and academic research.

Recorded on:
March 18 2020 at 2pm EST

Watch recording

Accurate De-Identification of Structured & Unstructured Medical Data at Scale

Recent advances in deep learning enable automated de-identification of medical data to approach the accuracy achievable via manual effort. This includes accurate detection & obfuscation of patient names, doctor names, locations, organizations, and dates from unstructured documents – or accurate detection of column names & values in structured tables. This webinar explains:

  • What’s required to de-identify medical records under the US HIPAA privacy rule
  • Typical de-identification use cases, for structured and unstructured data
  • How to implement de-identification of these use cases using Spark NLP for Healthcare

After the webinar, you will understand how to de-identify data automatically, accurately, and at scale, for the most common scenarios.

Julio Bonis

Julio Bonis is a data scientist working on Spark NLP for Healthcare at John Snow Labs. Julio has broad experience in software development and design of complex data products within the scope of Real World Evidence (RWE) and Natural Language Processing (NLP). He also has substantial clinical and management experience – including entrepreneurship and Medical Affairs. Julio is a medical doctor specialized in Family Medicine (registered GP), has an Executive MBA – IESE, an MSc in Bioinformatics, and an MSc in Epidemiology.

Recorded on:
February 26 2020 at 2pm EST

Watch recording

State-of-the-art named entity recognition with BERT

Deep neural network models have recently achieved state-of-the-art performance gains in a variety of natural language processing (NLP) tasks. However, these gains rely on the availability of large amounts of annotated examples, without which state-of-the-art performance is rarely achievable. This is especially inconvenient for the many NLP fields where annotated examples are scarce, such as medical text.

Named entity recognition (NER) is one of the most important tasks for development of more sophisticated NLP systems. In this webinar, we will walk you through how to train a custom NER model using BERT embeddings in Spark NLP – taking advantage of transfer learning to greatly reduce the amount of annotated text to achieve accurate results. After the webinar, you will be able to train your own NER models with your own data in Spark NLP.

Veysel Kocaman

Veysel Kocaman is a Senior Data Scientist and ML Engineer at John Snow Labs and has a decade long industry experience. He is also pursuing his PhD in CS as well as giving lectures at Leiden University (NL) and holds an MS degree in Operations Research from Penn State University. He is affiliated with Google as a Developer Expert in Machine Learning.