Turnkey Deployment of Medical Language Models as Private API Endpoints
Join us for an insightful webinar showcasing how John Snow Labs’ signature models can now be effortlessly deployed as private API endpoints and seamlessly integrated into your healthcare text-processing workflows.
We will walk you through comprehensive, end-to-end examples demonstrating how to discover, deploy, and utilize John Snow Labs’ state-of-the-art language models, fine-tuned for the healthcare domain, across three leading marketplaces: AWS SageMaker, Snowflake Marketplace, and Databricks Marketplace.
Key highlights include:
- Effortless Integration: Learn how easy it is to incorporate John Snow Labs language models into your existing workflows, enhancing efficiency and accuracy in medical text processing.
- Flexible Endpoints: Discover the convenience of turning API endpoints on and off based on your processing needs, optimizing cost-effectiveness.
- Scalable Infrastructure: Explore multiple infrastructure options designed to meet varying target scales and processing requirements.
- Performance Benchmarks: Utilize accuracy and throughput benchmarks to make informed decisions about the most suitable infrastructure for your needs.
Don’t miss this opportunity to streamline your deployment process, reduce integration hassle, and elevate the performance of your healthcare applications.
Kshitiz Shakya
Kshitz has over 10 years of experience working in software engineering. Currently, he works as a Software Engineer at John Snow Labs, building solutions and products that are helping businesses in their needs. He is deeply passionate about creating scalable and trusted products that are robust and make the lives of people easier.
Automated Testing of Bias, Fairness, and Robustness of Language Models in the Generative AI Lab
Testing and mitigating bias, fairness, and robustness issues in AI applications in now a legal requirement in the USA in regulated industries like healthcare, human resources, and financial services. This webinar presents new capabilities within the no-code Generative AI Lab, designed for building custom language models by non-technical domain experts, that enable compliance with such requirements and embody best practices for Responsible AI.
We’ll cover how you can:
- Create, edit, and reuse test suites
- Automatically generate test cases for robustness & bias
- Manually review and edit tests when needed
- Run LLM test suites and see both summarized and drill-down results
- Run regression testing before certifying new versions of models, or competing models
This webinar is intended for anyone interested in testing, certifying, and mitigating bias issues in custom language models for real-world systems.
David Cecchini
Ph.D. at Tsinghua-Berkeley Shenzhen Institute | Data Scientist
Fast, Cheap, Scalable: Open-Source LLM Inference with Spark NLP
Learn how the open-source Spark NLP library provides optimized and scalable LLM inference for high-volume text and image processing pipelines. This session dives into optimized LLM inference without the overhead of commercial APIs or extensive hardware setups. We will show live code examples and benchmarks comparing Spark NLP’s performance and cost-effectiveness against both commercial APIs and other open-source solutions.
Key Takeaways:
- Learn how to efficiently process millions of LLM interactions daily, circumventing the costs associated with traditional LLM deployments.
- Discover advanced methods for embedding LLM inference within existing data processing pipelines, enhancing throughput and reducing latency.
- Review benchmarks that compare Spark NLP’s speed and cost metrics relative to commercial and open-source alternatives.
Danilo Burbano
Danilo Burbano is a Software and Machine Learning Engineer at John Snow Labs. He holds an MSc in Computer Science and has 13 years of commercial experience.
He has previously developed several software solutions over distributed system environments like microservices and big data pipelines across different industries and countries. Danilo has contributed to Spark NLP for the last 6 years. He is now working to maintain and evolve the Spark NLP library by continuously adding state-of-the-art NLP tools to allow the community to implement and deploy cutting-edge large-scale projects for AI and NLP.
New State-of-the-art Accuracy for the 3 Primary Uses of Healthcare Language Models
This talk presents new levels of accuracy that have very recently been achieved, on public and independently reproducible benchmarks, on the three most common use cases for language models in healthcare:
- Understanding clinical documents: Such as information extraction from clinical notes and reports; detecting entities, relationships, and medical codes; de-identification; and summarization.
- Reasoning about patients: Fusing information across multiple modalities (tabular data, free text, imaging, omics) to create a longitudinal view of each patient, including making reasonable inferences and explaining them.
- Answering medical questions: Answering medical licensing exam questions, biomedical research questions, and similar medical knowledge questions – accurately, without hallucinations, and while citing relevant sources.
Join to learn what has recently become possible in the fast-changing world of Healthcare AI.
David Talby
David Talby is the Chief Technology Officer at John Snow Labs, helping companies apply artificial intelligence to solve real-world problems in healthcare and life science. David has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a Ph.D. in Computer Science and Master’s degrees in both Computer Science and Business Administration. He was named USA CTO of the Year by the Global 100 Awards in 2022 and Game Changers Awards in 2023.
The 2024 Generative AI in Healthcare Survey
This webinar presents key findings from the 2024 Generative AI in Healthcare Survey, conducted in February & March of 2024 by Gradient Flow to assess the key use cases, priorities, and concerns of professionals and technology leaders in Generative AI in healthcare. Topics covered:
- Current levels of adoption and budget allocation
- Types of language models being used
- Use cases for LLMs
- Priorities for evaluating LLMs and roadblocks
- LLM model enhancement strategies
- LLM testing for Responsible AI requirements
Ben Lorica
Ben Lorica is founder at Gradient Flow. He is a highly respected data scientist, having served leading roles at O’Reilly Media (Chief Data Scientist, Program Chair of the Strata Data Conference, O’Reilly Artificial Intelligence Conference, and TensorFlow World), at Databricks, and as an advisor to startups. He serves as co-chair for several leading industry conferences: the AI Conference, the NLP Summit, the Data+AI Summit, Ray Summit, and K1st World. He is the host of the Data Exchange podcast and edits the Gradient Flow newsletter.
David Talby
David Talby is the Chief Technology Officer at John Snow Labs, helping companies apply artificial intelligence to solve real-world problems in healthcare and life science. David has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a Ph.D. in Computer Science and Master’s degrees in both Computer Science and Business Administration. He was named USA CTO of the Year by the Global 100 Awards in 2022 and Game Changers Awards in 2023.
John Snow Labs’ Native Integrations with LangChain and HayStack
Learn to enhance Retrieval Augmented Generation (RAG) pipelines in this webinar on John Snow Labs’ integrations with LangChain and HayStack. This session highlights the ability to retain your existing pipeline structure while upgrading its accuracy and scalability. Accuracy is improved thanks to customizable embedding collection and document splitting. Using Spark NLP’s optimized pipelines greatly improves scalability, runtime speed, and as a result cost.
Learn how these native integrations enable an easy transition to more effective methods, enhancing document ingestion from diverse sources without overhauling existing systems. Whether your goal is to enhance data privacy, optimize NLP & LLM accuracy, or scale your RAG applications to millions of documents, this webinar will equip you with the knowledge and tools to fully leverage John Snow Labs’ software to get it done. Join us to unlock the potential of your applications with the latest innovations in Generative AI, without departing from the familiar toolset of your current pipeline.
Muhammet Santas
Muhammet Santas holds a Master’s Degree in Artificial Intelligence and currently serves as a Senior Data Scientist at John Snow Labs, where he is an integral part of the Healthcare NLP Team. With a robust background in AI, Muhammet contributes his expertise to advancing NLP technologies within the healthcare sector.
Next-Gen Table Extraction from Visual Documents: Leveraging Multimodal AI
Join us in exploring the latest advancements in multimodal AI for extracting tabular data from visual documents. This session will delve into novel methods implemented in John Snow Labs’ Visual NLP library, which has significantly improved the accuracy of information extraction and question answering from tables in PDFs and image files.
The webinar will cover a range of practical applications, demonstrating how this technology is adept at handling complex documents such as financial disclosures, clinical trial results, insurance rates, lab scores, and academic research. The focus will be zero-shot models, where the AI model directly interprets and responds to queries from source images, eliminating the need for specialized training or tuning.
We’ll also cover Visual NLP capabilities that have been specifically designed to enhance table extraction quality, especially in challenging cases like multi-line cells or borderless tables. We’ll discuss the technical underpinnings of this feature, including the integration of computer vision and object character recognition for detecting tables and individual cells within them. We’ll touch upon how that extends to support for tables with custom borders, dark & noisy backgrounds, uncommon table layouts, multilingual text, and international number & currency formats.
This webinar is ideal for professionals and researchers who face the challenge of converting complex visual data into actionable insights. Attendees will leave with a deeper understanding of how these cutting-edge AI models can be applied in various fields to improve data accessibility and analysis efficiency.
Alberto Andreotti
Alberto Andreotti is a data scientist at John Snow Labs, specializing in Machine Learning, Natural Language Processing, and Distributed Computing. With a background in Computer Engineering, he has expertise in developing software for both Embedded Systems and Distributed Applications.
Alberto is skilled in Java and C++ programming, particularly for mobile platforms. His focus includes Machine Learning, High-Performance Computing (HPC), and Distributed Systems, making him a pivotal member of the John Snow Labs team.
Building a RAG LLM Clinical Chatbot with John Snow Labs in Databricks
In the era of rapidly evolving Large Language Models (LLMs) and chatbot systems, we highlight the advantages of using LLM systems based on RAG (Retrieval Augmented Generation). These systems excel when accurate answers are preferred over creative ones, such as when answering questions about medical patients or clinical guidelines. RAG LLMs have the advantage of reducing hallucinations, by explaining the source of each fact, and enabling the use of private documents to answer questions. They also enable near-real-time data updates without re-tuning the LLM.
This session walks through the construction of a RAG (Retrieval Augmented Generation) Large Language Model (LLM) clinical chatbot system, leveraging John Snow Labs’ healthcare-specific LLM and NLP models within the Databricks platform.
The system leverages LLMs to query the knowledge base via vector database that is populated by Healthcare NLP at scale within a Databricks notebook. Coupled with a user-friendly graphical interface, this setup allows users to engage in productive conversations with the system, enhancing the efficiency and effectiveness of healthcare workflows. Acknowledging the need for data privacy, security, and compliance, this system runs fully within customers’ cloud infrastructure – with zero data sharing and no calls to external API’s.
Amir Kermany
Amir is the Technical Industry Lead for Healthcare & Life Sciences at Databricks, where he focuses on developing advanced analytics solution accelerators to help health care and life sciences organizations in their data and AI journey.
Veysel Kocaman
Veysel is Head of Data Science at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science.
Introducing the Medical Research Chatbot
Whether you’re a healthcare provider seeking literature-backed medical information, a biomedical researcher aiming for efficient knowledge discovery, or a data administrator looking to better query your proprietary documents at scale, the Medical Chatbot can get you trustworthy answers faster. In an era where data-driven decisions are at the core of medical work, while it’s impossible to manually keep up to date with the amount of new medical knowledge published every day, the need for an intelligent medical assistant has never been greater. This webinar introduces the Medical Chatbot, a software platform designed to bridge this gap.
Covered Features:
- Answer Medical Research Questions: Obtain accurate, literature-backed answers to your medical queries.
- Have a Conversation: Chat in natural language, remember context, and save responses for future reference.
- Cite Sources & Explain Responses: Each answer comes with cited references for further exploration.
- Build Custom Knowledge Bases: Create and query your own secure databases with proprietary documents.
- Keep Private Content Private: Control who has access to each knowledge base, and don’t share it outside your organization.
- Manage Role-Based Access: Easily manage users and groups, including single sign-on.
Register today to discover how the Medical Chatbot can transform your healthcare research.
Dia Trambitas
Dia Trambitas is a computer scientist with a specialized focus on Natural Language Processing (NLP). Serving as the Head of Product at John Snow Labs, Dia oversees the evolution of the NLP Lab, the best-in-class tool for text and image annotation in the healthcare domain.
Dia holds a Ph.D. in Computer Science focused on Semantic Web and ontology-based reasoning. She has a vivid interest in text processing and data extraction from unstructured documents, a subject she has been working on for the last decade.
Professionally, Dia has been involved in various information extraction and data science projects across sectors such as Finance, Investment Banking, Life Science, and Healthcare. Her comprehensive experience and knowledge in the field position her as a competent figure in the realms of NLP and Data Science.
Extracting Social Determinants of Health from Free-Text Medical Records
Social Determinants of Health (SDOH) are defined by the Centers for Disease Control and Prevention (CDC) as: “the conditions in the places where people live, learn, work, and play that affect a wide range of health and quality-of-life risks and outcomes”.
By creating a predicted profile of SDOH for a patient, the researcher/health professional can bolster information from existing SDOH assessments, find some guidance as to what aspects of SDOH screening to pay more attention to, and conduct interactions to screen for specific SDOH that may not be self-evident. This helps both less experienced personnel detect SDOH, as well as optimize the time of under resourced navigators.
Health care systems in the United States are increasingly interested in measuring and addressing SDOH. Advances in electronic health record systems and Natural Language Processing (NLP) create a unique opportunity to systematically document patient SDOH from digitized free-text notes. NLP is increasingly used in health care settings to extract important individual patient information and has demonstrated early success in identifying patient housing needs, homelessness status, and social support networks.
John Snow Labs’ Healthcare Natural Language Processing (NLP) library – the most widely used tool in the healthcare and life science industries – provides a large number of models that can be used to extract SDOH information for a patient.
This webinar will introduce the details of the studies and models trained by John Snow Labs.
Gursev Pirge
Gursev Pirge is a Researcher and Senior Data Scientist with demonstrated success improving the Spark NLP for Healthcare library and delivering hands-on projects in Healthcare and Life Sciences. He has strong statistical skills and presents to all levels of leadership to improve decision making. He has experience in Education, Logistics, Data Analysis and Data Science. He has a strong education background with a Doctor of Philosophy (Ph.D.) focused on Mechanical Engineering from Bogazici University.
From GPT-4 to Llama-2: Supercharging State-of-the-Art Embeddings for Vector Databases with Spark NLP
Hallucinations pose a significant challenge when operating Large Language Models (LLM) such as GPT-4, Llama-2, or Falcon, as they can notably compromise the application’s trustworthiness. Utilizing external knowledge sources allows us to operate LLMs using any data and helps reduce hallucinations. This can be achieved by using Retrieval Augmented QA, a technique that retrieves relevant information from your own dataset and feeds it to the Large Language Model for more tailored responses.
Additionally, implementing Retrieval Augmented QA can effectively address the issues of data freshness and the use of custom datasets. This is crucial, as even some of the world’s most powerful Large Language Models, such as GPT-4 or Llama-2, are unaware of recent events or private data stored in your databases. Retrieval Augmented Generation (RAG) is a feature that empowers Large Language Models (LLMs) to generate responses using your unique data.
However, it’s important to highlight that the process of vectorizing large volumes of text to populate Vector Databases can create a bottleneck in NLP pipelines. This challenge emerges because many NLP libraries, if not all, are not built to process millions of documents effectively, particularly when using state-of-the-art embedding models like BERT, RoBERTa, DeBERTa, or any other Large Language Models used for generating text embeddings.
Join us to explore the rapidly advancing field of Text embedding and Vector Databases. Deep dive into recently released Spark NLP 5.0, featuring advanced embedding models like INSTRUCTOR and E5. Learn how to enhance your CPU inference using ONNX and take advantage of native Cloud support to substantially boost your text vectorization process. Learn how to extend the knowledge sources of your Large Language Models (LLMs) to overcome common limitations, such as a restricted scope due to training data and an inability to incorporate new or specific datasets, including internal company documents.
In this webinar Maziyar Panahi will provide an in-depth understanding of how to exploit Spark NLP 5.0 to enhance your LLM’s efficiency, reliability, and scalability, improving your application’s overall performance. It’s an opportunity to learn practical strategies to boost retrieval for enterprise search, empowering businesses to take full advantage of technologies like OpenAI’s GPT-4 and Meta’s Llama-2 models.
Tech stack used in this Webinar:
– Could and managed services:
– AWS (Glue 3.0/4.0 & EMR)
– Databricks
– Generative AI (commercial and open-source LLM models):
– GPT-3.5 and GPT-4 by OpenAI
– Llama-2 13B & 70B fine-tuned for chat
– Falcon 40B fine-tuned on instructions
– Vector Database for RAG
– Elasticsearch (Elastic Stack)
– On-prem infrastructure
– HPE bare-metal server with Nvidia A100 & AMD EPYC
Maziyar Panahi
Maziyar Panahi is a Principal AI / ML engineer and a senior Team Lead with over a decade-long experience in public research. He leads a team behind Spark NLP at John Snow Labs, one of the most widely used NLP libraries in the enterprise.
He develops scalable NLP components using the latest techniques in deep learning and machine learning that includes classic ML, Language Models, Speech Recognition, and Computer Vision. He is an expert in designing, deploying, and maintaining ML and DL models in the JVM ecosystem and distributed computing engine (Apache Spark) at the production level.
He has extensive experience in computer networks and DevOps. He has been designing and implementing scalable solutions in Cloud platforms such as AWS, Azure, and OpenStack for the last 15 years. In the past, he also worked as a network engineer in high-level places after he completed his Microsoft and Cisco training (MCSE, MCSA, and CCNA).
He is a lecturer at The National School of Geographical Sciences teaching Big Data Platforms and Data Analytics. He is currently employed by The French National Centre for Scientific Research (CNRS) as IT Project Manager and working at the Institute of Complex Systems of Paris (ISCPIF).
Contract Understanding with Legal NLP: building a Paralegal Service with AI
Legal documents are essential for any legal profession, but the process of understanding and extracting meaningful information from these documents can be time-consuming and labor-intensive. This is where Natural Language Processing (NLP) comes into play.
In the realm of law, NLP has emerged as a transformative technology, revolutionizing the way legal professionals handle vast amounts of textual information. Legal NLP Document Understanding encompasses a range of techniques and tools designed to analyze, interpret, and extract relevant information from legal documents such as contracts, court cases, statutes, and legal opinions.
In this webinar, we are going to present how to leverage some of John Snow Labs Legal NLP capabilities, specifically Legal Classification, NER, Relation Extraction and Question Answering, to extract and understand the information from Agreements.
Furthermore, we will be pleased to introduce John Snow Labs Paralegal, an innovative product designed specifically for reviewing Non-Disclosure Agreement (NDA) contracts. With this groundbreaking service, the process is incredibly straightforward: simply send a docx version of your agreement to a designated email account, and within moments, you will receive comprehensive AI-powered legal feedback. Our advanced technology meticulously analyzes the document, leveraging the vast expertise of seasoned lawyers, to identify potential concerns and highlight subtle nuances present in the agreement. This invaluable resource empowers legal professionals by providing valuable insights based on accumulated experience, ensuring thorough and meticulous scrutiny of NDAs. Say goodbye to time-consuming manual reviews and embrace the future of legal document analysis with John Snow Labs Paralegal.
Juan Martinez
Jose Juan Martinez is a Sr. Data Scientist, working at John Snow labs since 2021. He has accumulated experience in developing NLP solutions for the Healthcare, Legal and Financial domains. Now he is leading the efforts in the Spark NLP for Finance and Spark NLP for Legal libraries.
Deliver Safe, Fair & Robust Language Models with the NLPTest Library
As the use of Natural Language Models (NLP) and Large Language Models (LLM’s) grows, so does the need for a comprehensive testing solution that evaluates their performance across tasks like question answering, summarization & paraphrasing, named entity recognition, and text classification. With numerous NLP libraries supported – including Spark NLP, Hugging Face Transformers, spaCy, OpenAI, and many additional LLM models and API’s – testing that your AI systems are unbiased, robust, fair, accurate, and representative is crucial.
In this webinar, Luca Martial will introduce the NLP Test Library, an innovative, open-source project developed by John Snow Labs. This powerful tool, available at no cost and installable in one line, allows users to generate and execute test cases for a variety of NLP models and libraries. The NLP Test Library not only tests your NLP pipelines, but also offers the ability to augment training data based on test results, facilitating continuous model improvement.
Join this webinar to explore how the NLP Test Library is transforming the evaluation of NLP models and learn how to harness its features to ensure your AI systems meet the highest standards of responsibility and performance. Visit nlptest.org to access this open-source tool and help our community advance towards a more responsible AI ecosystem.
Luca Martial
Luca Martial is a Senior Data Scientist at John Snow Labs, improving the Spark NLP for Healthcare library and delivering hands-on projects in Healthcare and Life Sciences. He also leads product development for the NLP Test library, an open-source responsible AI framework that ensures the delivery of safe and effective models into production.Luca Martial is a Senior Data Scientist at John Snow Labs, improving the Spark NLP for Healthcare library and delivering hands-on projects in Healthcare and Life Sciences. He also leads product development for the NLP Test library, an open-source responsible AI framework that ensures the delivery of safe and effective models into production.
Automated Summarization of Clinical Notes
In this webinar, Veysel will delve into the challenges of and need for text summarization and the importance of summarization in various domains, especially in healthcare. He will cover various techniques for text summarization, ranging from classical methods to cutting-edge approaches such as LLMs, extractive and abstractive summarization.
Veysel will then introduce you to the new clinical text summarization module in Spark NLP for Healthcare library, which offers state-of-the-art features and capabilities, based on one of the latest LLM architectures. This module has been trained and designed to cater specifically to the needs of the healthcare domain, enabling it to summarize clinical notes with high accuracy and efficiency.
Veysel will also have a hands-on session where you will get the chance to work with clinical text summarization models, use cases, and live examples.
Veysel Kocaman
Veysel is a Lead Data Scientist and ML Engineer at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science.
He is a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence, and big data with over ten years of experience. He’s also pursuing his Ph.D. in ML at Leiden University, Netherlands, and delivers graduate-level lectures in ML and Distributed Data Processing.
Veysel has broad consulting experience in Statistics, Data Science, Software Architecture, DevOps, Machine Learning, and AI to several start-ups, boot camps, and companies around the globe. He also speaks at Data Science & AI events, conferences and workshops, and has delivered more than a hundred talks at international as well as national conferences and meetups.
Zero-Shot Visual Question Answering
Visual Question Answering is emerging as a valuable tool for NLP practitioners. New “OCR-Free” models deliver better accuracy than ever before for information extraction from forms, reports, receipts, tickets, and other document types – without requiring training or tuning. In this webinar, we explore common use cases, describe how John Snow Labs’ Visual NLP delivers it with a few lines of code, and share best practices when building practical Visual Question Answering pipelines.
Alberto Andreotti
Alberto Andreotti is a senior data scientist on the Spark NLP team at John Snow Labs, where he implements state-of-the-art NLP algorithms on top of Spark. He has a decade of experience working for companies and as a consultant, specializing in the field of machine learning.
Alberto has written lots of low-level code in C/C++ and was an early Scala enthusiast and developer. A lifelong learner, he holds degrees in engineering and computer science and is working on a third in AI.
Alberto was born in Argentina. He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.
State-Of-The-Art Medical Data De-identification and Obfuscation
The process of de-identifying protected health information (PHI) from unstructured medical notes is often essential when working with patient-level documents, such as physician notes. Using current state-of-the-art techniques, automated de-identification of both structured and free-text medical text can be accomplished at the same level of accuracy as with human experts.
Recently, John Snow Labs’ Healthcare Natural Language Processing (NLP) library – the most widely used such tool in the healthcare and life science industries – has achieved new state-of-the-art accuracy on standardized benchmarks. This webinar will introduce this solution and compare its accuracy, speed, and scalability to human efforts and to the three major cloud providers.
Join us for this webinar, where we will delve into practical implementation details and scenarios. Attendees will:
- Understand text de-identification in various human languages
- Discuss data obfuscation techniques
- Review the recommended setup for industrial-strength deployment
Jiri Dobes
Jiri Dobes is the Head of Solutions in John Snow Labs. He has been leading the development of machine learning solutions in healthcare and other domains for the past five years. Jiri is a PMP certified project manager. His previous experience includes delivering large projects in the power generation sector and consulting for the Boston Consulting Group and large pharma. Jiri holds a Ph.D. in mathematical modeling.
Veysel Kocaman
Veysel is Head of Data Science at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science.
He is a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence, and big data with over ten years of experience. He’s also pursuing his Ph.D. in ML at Leiden University, Netherlands, and delivers graduate-level lectures in ML and Distributed Data Processing.
Veysel has broad consulting experience in Statistics, Data Science, Software Architecture, DevOps, Machine Learning, and AI to several start-ups, boot camps, and companies around the globe. He also speaks at Data Science & AI events, conferences and workshops, and has delivered more than 20 talks at International as well as national conferences and meetups.
Combining Prompt Engineering, Programmatic Labelling, and Model Tuning in the No-Code NLP Lab
Data extraction from text is a day-to-day task for specialists working in verticals such as Healthcare, Finance, or Legal. NLP models are now a well-established solution with proven utility in automating such efforts. However, they require extensive data for training and tuning models to reach maximum accuracy, as well as technical knowledge to operate them. Furthermore, NLP models are not yet a commodity and do not cover out-of-the-box all extraction needs that a team might have. How can users tune models to cover the blind spots overlooked by pre-trained models?
In this webinar, we present the NLP Labs as a solution. The NLP Lab is an End-to-End No-Code platform that allows domain experts to quickly test how efficient NLP models are on custom documents, tune them for their data, and export them for production deployment. When pre-trained models are not available, NLP Lab lets you combine 3 approaches:
- Programmatic labeling – custom rules for entity extraction
- Prompt Engineering – natural language prompts for extracting custom entities and relations
- Transfer Learning – train & tune custom deep learning models using when annotated data
Join this webinar to learn how easy to use the NLP Lab is and how quickly you can start applying it to your own documents.
Dia Trambitas
Dia Trambitas is a computer scientist with a rich background in Natural Language Processing. She leads the development of the NLP Lab, currently the best-in-class tool for text and image annotation for healthcare.
Dia holds a Ph.D. in Computer Science focused on Semantic Web and ontology-based reasoning. She has a vivid interest in text processing and data extraction from unstructured documents, a subject she has been working on for the last decade. She has broad experience delivering information extraction and data science projects across Finance, Investment Banking, Life Science, and Healthcare.
NLP for Oncology: Extracting Staging, Histology, Tumor, Biomarker, and Treatment Facts from Clinical Notes
According to the World Health Organization, cancer is the leading cause of death worldwide, accounting for nearly 10 million deaths in 2020. A lot of efforts are made every day to try to reduce the burden of this disease, including measures to reduce people’s exposure to risk factors, screening programs for early detection, and clinical research to develop new treatments. In this context, Natural Language Processing has a great potential because it can be used to extract relevant information from oncology texts in a very efficient way.
In this webinar you will learn how to leverage Spark NLP for Healthcare to extract oncology-related data, such as tumor characteristics, disease stage or cancer treatments. We will walk through Python notebooks to show how to apply different types of models (Named-Entity Recognition, Assertion Status, Relation Extraction and Entity Resolution) to oncology texts.
Mauro Nievas Offidani
Mauro Nievas Offidani is a medical doctor and a data scientist who works at John Snow Labs as part of the Healthcare NLP Team. He holds an MSc Degree in Epidemiology and Healthcare Management.
Mauro has worked as a data annotator, annotation lead and a data scientist in different clinical NLP projects. His main focus of activity includes the development of oncology-specific NLP models.
Zero-Shot Learning of Healthcare NLP Models
Zero-Shot Learning (ZSL) is a new paradigm that has gained massive popularity recently due to its potential of reducing data annotations and high generalisability. In the pursuit of bringing product-ready latest ML research to our community, we have implemented ZSL for two major tasks in Spark NLP for Healthcare: Named Entity Recognition (NER) and Relation Extraction (RE).
In this session, we will explore ZSL models available as part of Spark NLP for healthcare library, how to use them using automatic prompt generation using Q&A models, and finally, how they perform on real data and help reduce data annotation requirements.
Hasham Ul Haq
Hasham Ul Haq is a Data Scientist at John Snow Labs, and an AI scholar and researcher at PI School of AI. During his carrier, he has worked on numerous projects across various sectors, including healthcare. At John Snow Labs, his primary focus is to build scalable and pragmatic systems for NLP, that are both, production-ready, and give SOTA performance. In particular, he has been working on Span detection, Natural Language Inference, disambiguation, Named Entity Recognition, and a lot more! Hasham also has an active research profile with publications in NeurIPS, AAAI, and multiple scholarship grants and affiliations.
Prior to John Snow Labs, he was leading search engine and knowledge base development at one of Europe’s largest telecom providers. He has also been mentoring startups in computer vision by providing trainings and designing ML architectures
Automated Text Generation & Data-Augmentation for Medicine, Finance, Law, and E-Commerce
This webinar teaches you how to leverage the human-level text generation capabilities of Large Transformer models to increase the accuracy of most NLP classifiers for Medicine, Finance, and Legal datasets with the Spark NLP library.
We will also explore the next-generation capabilities for E-Commerce and creative writing to enable the creation of automated marketing text.
Additionally, you will learn and understand intuitively how the mysterious generation parameters Temperature, Top-K-Sampling and Top-P-Nucleus sampling influence drawing from WordDistributions influences the generated text of Transformer Models. All automates and scales effortlessly to Industry Scale GPU or CPU clusters with the underlying Spark Engine.
Christian Kasim Loan
Christian Kasim Loan is a Lead Data Scientist and Scala expert at John Snow Labs and a Computer Scientist with over a decade of experience in software and worked on various projects in Big Data, Data Science and Blockchain using modern technologies such as Kubernetes, Docker, Spark, Kafka, Hadoop, Ethereum, and almost 20 programming languages to create modern cloud-agnostic AI solutions, decentralized applications, and analytical dashboards.
He has deep knowledge of Time-Series Graphs from his previous research in scalable and accurate traffic flow prediction and working on various Spatio-Temporal problems embedded in graphs at a Daimler lab.
Before his graph research, he worked on scalable meta machine learning, visual emotion extraction, and chatbots for various use cases at the Distributed Artificial Intelligence lab (DAI) in Berlin.
His most recent work includes the NLU library, which democratizes 5000+ state-of-the-art NLP models in 200+ languages in just 1 line of code for dozens of domains, with built-in visualizations and all scalable natively in Spark Clusters by its underlying Spark NLP distribution engine.
Text classification and named entity recognition with BertForTokenClassification & BertForSequenceClassification
Recognizing entities is a fundamental step towards understanding unstructured data in documents. Spark NLP includes state-of-the-art BERT-based models for token classification and sequence classification.
This session will cover the background and motivation behind these models, practical code implementations and introduce you to some of the pretrained models available in Spark NLP and its licensed counterpart, Spark NLP for Healthcare.
Luca Martial
Luca Martial is a Data Scientist at John Snow Labs. In this role, he has been building custom NLP solutions to showcase John Snow Labs’ healthcare library capabilities to customers, and training Spark NLP models for named entity recognition, relation extraction, text classification, de-identification and clinical entity resolution of medical notes and reports.
Zero Shot Learning for Semantic Relation Extraction from Unstructured Text
Relation Extraction, which is one of the most important tasks of NLP applications in healthcare, is an expensive process to find competent people who can label the data and label the data in order to train the models. By using the Zero-Shot Learning method, which has recently been used in the field of NLP, it has now become possible to train Relation Extraction models without the need for data labeling. In this presentation, we will explain how to use the Zero-Shot Learning method for Relation Extraction in unstructured texts.
Muhammet Santas
Muhammet Santas has a Master’s Degree St. in Artificial Intelligence and works as a Data Scientist at John Snow Labs as part of the Healthcare NLP Team.
Building Real-World Healthcare AI Projects from Concept to Production
In this Webinar, Juan Martinez from John Snow Labs and Ken Puffer from ePlus will share lessons learned from recent AI, ML, and NLP projects that have been successfully built & deployed in US hospital systems:
- Improving patient flow forecasting at Kaiser Permanente
- A real-time clinical decision support platform for Psychiatry and Oncology at Mount Sinai
- Automated de-identification of 700 million patient notes at Providence Health
Then they will showcase a live demo of the recently launched AI Workflow Accelerator Bundle for Healthcare, which provides a complete data science platform including supporting the full AI lifecycle:
- Data analysis: Enable data analysts to query, visualize & build dashboards without coding
- Data science: Enable data scientists to train models, share & scale experiments
- Model deployment options
- Operations: Enable DevOps & DataOps engineers to monitor, secure, and scale
The bundle is a turnkey solution composed of GPU-accelerated hardware from NVIDIA, proprietary software from John Snow Labs, and implementation services from ePlus. It is unique in providing all of the following healthcare-specific capabilities out of the box:
- 2,300+ current, clean, and enriched healthcare datasets – from ontologies to benchmarks
- Spark NLP for Healthcare – the most widely used NLP library in the healthcare industry – along with 250+ pre-trained clinical & biomedical NLP models for analyzing unstructured data
- Spark OCR – including the ability to read, de-identify, and extract information from DICOM images
- Security controls implemented within the platform, to enable a team of data scientists to effectively work & collaborate in air-gap, high-compliance environments
We will share speed & accuracy benchmarks measuring the optimization of John Snow Labs software and models on the GPU-accelerated Nvidia hardware – and how this translates to enabling your AI team to deliver bigger projects faster.
Juan Martinez
Juan Martinez is a Sr. Data Scientist, working at John Snow Labs since 2021. He graduated from Computer Engineering in 2006, and from that time on, his main focus of activity has been the application of Artificial Intelligence to texts and unstructured data. To better understand the intersection between Language and AI, he complemented his technical background with a Linguistics degree from Moscow Pushkin State Language Institute in 2012 and later on on University of Alcala (2014).
He is part of the Healthcare Data Science team at John Snow Labs. His main activities are training and evaluation of Deep Learning, Semantic and Symbolic models within the Healthcare domain, benchmarking, research and team coordination tasks. His other areas of interest are Machine Learning operations and Infrastructure.
Ken Puffer
Ken Puffer is the Chief Technology Officer for Healthcare solutions at ePlus. In this role, Ken consults with a broad range of healthcare leaders and technology partners to help ePlus develop, deploy, optimize, and maintain solutions that help solve the unique challenges facing healthcare.
Deeper Clinical Document Understanding Using Relation Extraction
Recognizing entities is a fundamental step towards understanding a piece of text – but entities alone only tell half the story. The other half comes from explaining the relationships between entities. Spark NLP for Healthcare includes state-of-the-art (SOTA) deep learning models that address this issue by semantically relating entities in unstructured data.
John Snow Labs has developed multiple models utilizing BERT architectures with custom feature generation to achieve peer-reviewed SOTA accuracy on multiple benchmark datasets. This session will shed light on the background and motivation behind relation extraction, techniques, real-world use cases, and practical code implementation.
Hasham Ul Haq
Hasham Ul Haq is a Data Scientist at John Snow Labs, and an AI scholar and researcher at PI School of AI. During his carrier, he has worked on numerous projects across various sectors, including healthcare. At John Snow Labs, his primary focus is to build scalable and pragmatic systems for NLP, that are both, production-ready, and give SOTA performance. In particular, he has been working on Natural Language Inference, disambiguation, Named Entity Recognition, and a lot more! Hasham also has an active research profile with a publications in NeurIPS, AAAI, and multiple scholarship grants and affiliations.
Prior to John Snow Labs, he was leading search engine and knowledge base development at one of Europe’s largest telecom providers. He has also been mentoring startups in computer vision by providing trainings and designing ML architectures.
Rule-Based and Pattern Matching for Entity Recognition in Spark NLP
Finding patterns and matching strategies are well-known NLP procedures to extract information from text.
Spark NLP library has two annotators that can use these techniques to extract relevant information or recognize entities of interest in large-scale environments when dealing with lots of documents from medical records, web pages, or data gathered from social media.
In this talk, we will see how to retrieve the information we are looking for by using the following annotators:
- Entity Ruler, an annotator available in open-source Spark NLP.
- Contextual Parser, an annotator available only in Spark NLP for Healthcare.
- In addition, we will enumerate use cases where we can apply these annotators.
After this webinar, you will know when to use a rule approach to extract information from your data and the best way to set the available parameters in these annotators.
Danilo Burbano
Danilo Burbano is a Software and Machine Learning Engineer at John Snow Labs. He holds an MSc in Computer Science and has 12 years of commercial experience.
He has previously developed several software solutions over distributed system environments like microservices and big data pipelines across different industries and countries. Danilo has contributed to Spark NLP for the last 4 years. He is now working to maintain and evolve the Spark NLP library by continuously adding state-of-the-art NLP tools to allow the community to implement and deploy cutting-edge large-scale projects for AI and NLP.
Automating Clinical Trial Master File Migration & Information Extraction
Pharmaceutical Companies who conduct clinical trials, looking to get new treatments to market as quickly as possible, possess a high volume of documents. Millions of documents can be created as part of one trial and are stored in a document management system. In case migrating these documents to a new system is needed – for example, when a pharma company acquires the rights to a drug or trial – all these documents must often be read manually in order to classify them and extract metadata that is legally required and must be accurate. Traditionally, this migration is a long, complex, and labor-intensive process.
We present a solution based on the natural language processing (NLP) system which provides:
- Speed – 80% reduction of manual labor and migration timeline, proven in major real-world projects
- State of the art accuracy – based on Spark NLP for Healthcare, integrated in a human-in-the-loop solution
- End-to-end, secure and compliant solution – Air-gap deployment, GxP and GAMP 5 validated
We will share lessons learned from an end-to-end migration process of the trial master file in Novartis.
Jiri Dobes
Jiri Dobes is the Head of Solutions in John Snow Labs. He has been leading the development of machine learning solutions in healthcare and other domains for the past five years. Jiri is a PMP certified project manager. His previous experience includes delivering large projects in the power generation sector and consulting for the Boston Consulting Group and large pharma. Jiri holds a Ph.D. in mathematical modeling.
Enterprise-Scale Data Labeling & Automated Model Training with the Free Annotation Lab
Extracting data from unstructured documents is a common requirement – from finance and insurance to pharma and healthcare. Recent advances in deep learning offer impressive results on this task when models are trained on large enough datasets.
However, getting high-quality data involves a lot of manual effort. An annotation project is defined, annotation guidelines are specified, documents are imported, tasks are distributed among domain experts, a manager tracks the team’s performance, inter-annotator agreement is reached, and the resulting annotations are exported into a standard format. At enterprise-scale, complexity grows due to the volume of projects, tasks, and users.
John Snow Labs’ Annotation Lab is a free annotation tool that has already been deployed and used by large-scale enterprises for three years. This webinar presents how you can exploit the tool’s capabilities to easily manage any annotation project – from small team to enterprise-wide. It also shows how models can be trained automatically, without writing a single line of code, and how any pre-trained model can be used to pre-annotate documents to speed up projects by 5x – since domain experts don’t start annotating from scratch but correct and improve the models, as part of a no-code human-in-the-loop AI workflow.
Nabin Khadka
Nabin Khada leads the team building the Annotation Lab at John Snow Labs. He has 7 years of experience as a software engineer, covering a broad range of technologies from web & mobile apps to distributed systems and large-scale machine learning.
Creating a Clinical Knowledge Graph with Spark NLP and neo4j
The knowledge graph represents a collection of connected entities and their relations. A knowledge graph that is fueled by machine learning utilizes natural language processing to construct a comprehensive and semantic view of the entities. A complete knowledge graph allows answering and search systems to retrieve answers to given queries. In this study, we built a knowledge graph using Spark NLP models and Neo4j. The marriage of Spark NLP and Neo4j is very promising for creating clinical knowledge graphs to do a deeper analysis, Q&A tasks, and get insights.
Ali Emre Varol
Ali Emre Varol is a data scientist working on Spark NLP for Healthcare at John Snow Labs with a decade of industry experience. He has previously worked as a software engineer to develop ERP solutions and led teams and projects building machine learning solutions in a variety of industries. He is also pursuing his Ph.D. in Industrial Engineering at Middle East Technical University and holds an MS degree in Industrial Engineering.
1 Line of Code to Use 200+ State-of-the-Art Clinical & Biomedical NLP Models
In this Webinar, Christian Kasim Loan will teach you how to leverage the hundreds of medical State-of-the-Art models for various Medical and Healthcare domains in 1 line of code like Named Entity Recognition (NER) for Adverse Drug Events, Anatomy, Diseases, Chemicals, Clinical Events, Human Phenotypes, Posology, Radiology, Measurements, and many other fields plus the best in class resolution algorithms to map the extracted entities into medical code terminologies like ICD10, ICD0, RXNORM, SNOMED, LOINC, and many more.
Additionally, we will showcase how to extract the relationship between predicted entities for the Posology, Drug Adverse Effects, Temporal Features, Body Party problems, Procedures domains, and how to De-Identify your text documents.
Finally, we will take a look at the latest NLU Streamlit features and how you can leverage them to visualize all model predictions and test them out with 0 lines of code in your web browser!
Christian Kasim Loan
Data Scientist and Spark/Scala ML engineer
Accurate Table Extraction from Documents & Images with Spark OCR
Extracting data formatted as a table (tabular data) is a common task — whether you’re analyzing financial statements, academic research papers, or clinical trial documentation. Table-based information varies heavily in appearance, fonts, borders, and layouts. This makes the data extraction task challenging even when the text is searchable – but more so when the table is only available as an image.
This webinar presents how Spark OCR automatically extracts tabular data from images. This end-to-end solution includes computer vision models for table detection and table structure recognition, as well as OCR models for extracting text & numbers from each cell. The implemented approach provides state-of-the-art accuracy for the ICDAR 2013 and TableBank benchmark datasets.
Mykola Melnyk
Mykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience. He has led teams and projects building machine learning and big data solutions in a variety of industries – and is currently the lead developer of the Spark OCR library at John Snow Labs.
Speed Optimization & Benchmarks in Spark NLP 3: Making the Most of Modern Hardware
Spark NLP is the most widely used NLP library in the enterprise, thanks to implementing production-grade, trainable, and scalable versions of state-of-the-art deep learning & transfer learning NLP research. It is also Open Source with a permissive Apache 2.0 license that officially supports Python, Java, and Scala languages backed by a highly active community and JSL members.
Spark NLP library implements core NLP algorithms including lemmatization, part of speech tagging, dependency parsing, named entity recognition, spell checking, multi-class and multi-label text classification, sentiment analysis, emotion detection, unsupervised keyword extraction, and state-of-the-art Transformers such as BERT, ELECTRA, ELMO, ALBERT, XLNet, and Universal Sentence Encoder.
The latest release of Spark NLP 3.0 comes with over 1100+ pretrained models, pipelines, and Transformers in 190+ different languages. It also delivers massive speeds up on both CPU & GPU devices while extending support for the latest computing platforms such as new Databricks runtimes and EMR versions.
The talk will focus on how to scale Apache Spark / PySpark applications in YARN clusters, use GPU in Databricks new Apache Spark 3.x runtimes, and manage large-scale datasets in resource-demanding NLP applications efficiently. We will share benchmarks, tips & tricks, and lessons learned when scaling Spark NLP.
Maziyar Panahi
Maziyar Panahi is a Senior Data Scientist and Spark NLP Lead at John Snow Labs with over a decade long experience in public research. He is a senior Big Data engineer and a Cloud architect with extensive experience in computer networks and software engineering. He has been developing software and planning networks for the last 15 years. In the past, he also worked as a network engineer in high-level places after he completed his Microsoft and Cisco training (MCSE, MCSA, and CCNA).
He has been designing and implementing large-scale databases and real-time Web services in public and private Clouds such as AWS, Azure, and OpenStack for the past decade. He is one of the early adopters and main maintainers of the Spark NLP library. He is currently employed by The French National Centre for Scientific Research (CNRS) as a Big Data engineer and System/Network Administrator working at the Institute of Complex Systems of Paris (ISCPIF).
Visual Document Understanding with Multi-Modal Image & Text Mining in Spark OCR 3
The Transformer architecture in NLP has truly changed the way we analyze text. NLP models are great at processing digital text, but many real-word applications use documents with more complex formats. For example, healthcare systems often include visual lab results, sequencing reports, clinical trial forms, and other scanned documents. When we only use an NLP approach for document understanding, we lose layout and style information – which can be vital for document image understanding. New advances in multi-modal learning allow models to learn from both the text in documents (via NLP) and visual layout (via computer vision).
We provide multi-modal visual document understanding, built on Spark OCR based on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding (from 70.7 to 79.3), receipt understanding (from 94.0 to 95.2) and document image classification (from 93.1 to 94.4).
Mykola Melnyk
Mykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience. He has led teams and projects building machine learning and big data solutions in a variety of industries – and is currently the lead developer of the Spark OCR library at John Snow Labs.
Using & Expanding the NLP Models Hub
The NLP Models Hub which powers the Spark NLP and NLU libraries takes a different approach than the hubs of other libraries like TensorFlow, PyTorch, and Hugging Face. While it also provides an easy-to-use interface to find, understand, and reuse pre-trained models, it focuses on providing production-grade state-of-the-art models for each NLP task instead of a comprehensive archive.
This implies a higher quality bar for accepting community contributions to the NLP Models Hub – in terms of automated testing, level of documentation, and transparency of accuracy metrics and training datasets. This webinar shows how you can make the most of it, whether you’re looking to easily reuse models or contribute new ones.
Dia Trambitas
Dia Trambitas is a computer scientist with a rich background in Natural Language Processing. She has a Ph.D. in Semantic Web from the University of Grenoble, France, where she worked on ways of describing spatial and temporal data using OWL ontologies and reasoning based on semantic annotations. She then changed her interest to text processing and data extraction from unstructured documents, a subject she has been working on for the last 10 years. She has a rich experience working with different annotation tools and leading document classification and NER extraction projects in verticals such as Finance, Investment, Banking, and Healthcare.
State-of-the-art Natural Language Processing for 200+ Languages with 1 Line of code
Learn to harness the power of 1,000+ production-grade & scalable NLP models for 200+ languages – all available with just 1 line of Python code by leveraging the open-source NLU library, which is powered by the widely popular Spark NLP.
John Snow Labs has delivered over 80 releases of Spark NLP to date, making it the most widely used NLP library in the enterprise and providing the AI community with state-of-the-art accuracy and scale for a variety of common NLP tasks. The most recent releases include pre-trained models for over 200 languages – including languages that do not use spaces for word segmentation algorithms like Chinese, Japanese, and Korean, and languages written from right to left like Arabic, Farsi, Urdu, and Hebrew. All software and models are free and open source under an Apache 2.0 license.
This webinar will show you how to leverage the multi-lingual capabilities of Spark NLP & NLU – including automated language detection for up to 375 languages, and the ability to perform translation, named entity recognition, stopword removal, lemmatization, and more in a variety of language families. We will create Python code in real-time and solve these problems in just 30 minutes. The notebooks will then be made freely available online.
Christian Kasim Loan
Data Scientist and Spark/Scala ML engineer
Automated Drug Adverse Event Detection from Unstructured Text
Adverse Drug Events (ADEs) are potentially very dangerous to patients and are amongst the top causes of morbidity and mortality. Monitoring & reporting of ADEs is required by pharma companies and healthcare providers. This session introduces new state-of-the-art deep learning models for automatically detecting if a free-text paragraph includes an ADE (document classification), as well as extracting the key terms of the event in structured form (named entity recognition). Using live Python notebooks and real examples from clinical and conversational text, we’ll show how to apply these models using the Spark NLP for Healthcare library.
Julio Bonis
Julio Bonis is a data scientist working on Spark NLP for Healthcare at John Snow Labs. Julio has broad experience in software development and design of complex data products within the scope of Real World Evidence (RWE) and Natural Language Processing (NLP). He also has substantial clinical and management experience – including entrepreneurship and Medical Affairs. Julio is a medical doctor specialized in Family Medicine (registered GP), has an Executive MBA – IESE, an MSc in Bioinformatics, and an MSc in Epidemiology.
John Snow Labs NLU: Become a Data Science Superhero with One Line of Python code
Learn how to unleash the power of 350+ pre-trained NLP models, 100+ Word Embeddings, 50+ Sentence Embeddings, and 50+ Classifiers in 46 languages with 1 line of Python code. John Snow Labs’ new NLU library marries the power of Spark NLP with the simplicity of Python. Tackle NLP tasks like NER, POS, Emotion Analysis, Keyword extraction, Question answering, Sarcasm Detection, Document classification using state-of-the-art techniques. The end-to-end library includes word & sentence embeddings like BERT, ELMO, ALBERT, XLNET, ELECTRA, USE, Small-BERT, and others; text wrangling and cleaning like tokenization, chunking, lemmatizing, stemming, normalizing, spell-checking, and matchers; and easy visualization capabilities using your embedded data with T-SNE.
Christian Kasim Loan, the creator of NLU, will walk through NLU and show you how easy it is to generate T-SNE visualizations of 6 Deep Learning Embeddings, achieve top classification results on text problems from Kaggle competition with 1 line of NLU code, and leverage the latest & greatest advances in deep learning & transfer learning.
Christian Kasim Loan
Data Scientist and Spark/Scala ML engineer
Answering natural language questions
The ability to directly answer medical questions asked in natural language either about a single entity (“what drugs has this patient been prescribed?”) or a set of entities (“list stage 4 lung cancer patients with no history of smoking”) has been a longstanding industry goal, given its broad applicability across many use cases.
This webinar presents a software solution, based on state-of-the-art deep learning and transfer learning research, for translating natural language questions to SQL statements. An actual case study will be a system which answers clinical questions by training domain-specific models and learning from reference data. This is a production-grade, trainable and scalable capability of Spark NLP Enterprise. Live notebooks will be shared to explain how you can use it in your own projects.
Prabod Rathnayaka
Graduate Research Assistant and PhD Student at La Trobe University
Accurate de-identification, obfuscation, and editing of scanned medical documents and images
One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results, referrals, or genetic testing to DICOM files with medical imaging. These files are challenging to de-identify, because personal health information (PHI) can appear anywhere in free text – so cannot be removed with rules or regular expressions – or “burned” into images so that it’s not even available as digital text to begin with.
This webinar presents a software system that tackles these challenges, with lessons learned from applying it in real-world production systems. The workflow uses:
- Spark OCR to extract both digital and scanned text from PDF and DICOM files
- Spark NLP for Healthcare to recognize sensitive data in the extracted free text
- The de-identification module to delete, replace, or obfuscate PHI
- Spark OCR to generate new PDF or DICOM file with the de-identified data
- Run the whole workflow within a local secure environment, with no need to share data with any third party or a public cloud API
Dr. Alina Petukhova
Data Scientist at John Snow Labs
Hardening a Cleanroom AI Platform to allow model training & inference on Protected Health Information
Artificial intelligence projects in high-compliance industries, like healthcare and life science, often require processing Protected Health Information (PHI). This may happen because the nature of the projects does not allow full de-identification in advance – for example, when dealing with rare diseases, genetic sequencing data, identify theft, or training de-identification models – or when training is anonymized data but inference must happen on data with PHI.
In such scenarios, the alternative is to create an “AI cleanroom” – an isolated, hardened, air-gap environment where the work happens. Such a software platform should enable data scientists to log into the cleanroom, and do all the development work inside it – from initial data exploration & experimentation to model deployment & operations – while no data, computation, or generated assets ever leave the cleanroom.
This webinar presents the architecture of such a Cleanroom AI Platform, which has been actively used by Fortune 500 companies for the past three years. Second, it will survey the hundreds of DevOps & SecOps features requires to realize such a platform – from multi-factor authentication and point-to-point encryption to vulnerability scanning and network isolation. Third, it will explain how a Kubernetes-based architecture enables “Cleanroom AI” without giving up on the main benefits of cloud computing: elasticity, scalability, turnkey deployment, and a fully managed environment.
Ali Naqvi
Ali Naqvi is the lead product manager of the AI Platform at John Snow Labs. Ali has extensive experience building end-to-end data science platform & solution for the healthcare and life science industries, using modern technology stacks such as Kubernetes, TensorFlow, Spark, mlFlow, Elastic, Nifi, and related tools. Ali has a Master’s degree in Molecular Science and over a decade of hands-on experience in software engineering and academic research.
Maximizing Text Recognition Accuracy with Image Transformers in Spark OCR
Spark OCR is an object character recognition library that can scale natively on any Spark cluster; enables processing documents privately without uploading them to a cloud service; and most importantly, provides state-of-the-art accuracy for a variety of common use cases. A primary method of maximizing accuracy is using a set of pre-built image pre-processing transformers – for noise reduction, skew correction, object removal, automated scaling, erosion, binarization, and dilation. These transformers can be combined into OCR pipelines that effectively resolve common ‘document noise’ issues that reduce OCR accuracy.
This webinar describes real-world OCR use cases, common accuracy issues they bring, and how to use image transformers in Spark OCR in order to resolve them at scale. Example Python code will be shared using executable notebooks that will be made publicly available.
Mykola Melnyk
Mykola Melnyk is a senior Scala, Python, and Spark software engineer with 15 years of industry experience. He has led teams and projects building machine learning and big data solutions in a variety of industries – and is currently the lead developer of the Spark OCR library at John Snow Labs.
Best Practices & Tools for Accurate Document Annotation and Data Abstraction
Are you working on machine learning tasks such as sentiment analysis, named entity recognition, text classification, image classification or audio segmentation? If so, you need training data adapted for your particular domain and task.
This webinar will explain the best practices and strategies for getting the training data you need. We will go over the setup of the annotation team, the workflows that need to be in place for guaranteeing high accuracy and labeler agreement, and the tools that will help you increase productivity and eliminate errors.
Dia Trambitas
Dia Trambitas is a computer scientist with a rich background in Natural Language Processing. She has a Ph.D. in Semantic Web from the University of Grenoble, France, where she worked on ways of describing spatial and temporal data using OWL ontologies and reasoning based on semantic annotations. She then changed her interest to text processing and data extraction from unstructured documents, a subject she has been working on for the last 10 years. She has a rich experience working with different annotation tools and leading document classification and NER extraction projects in verticals such as Finance, Investment, Banking, and Healthcare.
Automated Mapping of Clinical Entities from Natural Language Text to Medical Terminologies
The immense variety of terms, jargon, and acronyms used in medical documents means that named entity recognition of diseases, drugs, procedures, and other clinical entities isn’t enough for most real-world healthcare AI applications. For example, knowing that “renal insufficiency”, “decreased renal function” and “renal failure” should be mapped to the same code, before using that code as a feature in a patient risk prediction or clinical guidelines recommendation model, is critical to that’s model’s accuracy. Without it, the training algorithm will see these three terms as three separate features and will severely under-estimate the relevance of this condition.
This need for entity resolution, also known as entity normalization, is therefore a key requirement from a healthcare NLP library. This webinar explains how Spark NLP for Healthcare addresses this issue by providing trainable, deep-learning-based, clinical entity resolution, as well as pre-trained models for the most commonly used medical terminologies: SNOMED-CT, RxNorm, ICD-10-CM, ICD-10-PCS, and CPT.
Andres Fernandez
Andrés Fernández is a Machine Learning Engineer and Data Scientist at John Snow Labs with 10 years of experience in the Finance, Retail and Healthcare industries.
After his MSc in Software Engineering at the University of Málaga, he has been helping Latin American and USA companies conceptualize, design and build AI solutions to automate their operations in functions like Insurance Claims, Pricing, Retail Procurement, Marketing, and others. Andrés has dedicated the last 5 years of his experience to deal with real-world applications for Natural Language Processing focusing mainly on Log Processing, Text Clustering, and Entity Resolution.
AI Model Governance in a High-Compliance Industry
Model governance defines a collection of best practices for data science – versioning, reproducibility, experiment tracking, automated CI/CD, and others. Within a high-compliance setting where the data used for training or inference contains private health information (PHI) or similarly sensitive data, additional requirements such as strong identity management, role-based access control, approval workflows, and full audit trail are added.
This webinar summarizes requirements and best practices for establishing a high-productivity data science team within a high-compliance environment. It then demonstrates how these requirements can be met using John Snow Labs’ Healthcare AI Platform.
Ali Naqvi
Ali Naqvi is the lead product manager of the AI Platform at John Snow Labs. Ali has extensive experience building end-to-end data science platform & solution for the healthcare and life science industries, using modern technology stacks such as Kubernetes, TensorFlow, Spark, mlFlow, Elastic, Nifi, and related tools. Ali has a Master’s degree in Molecular Science and over a decade of hands-on experience in software engineering and academic research.
Accurate De-Identification of Structured & Unstructured Medical Data at Scale
Recent advances in deep learning enable automated de-identification of medical data to approach the accuracy achievable via manual effort. This includes accurate detection & obfuscation of patient names, doctor names, locations, organizations, and dates from unstructured documents – or accurate detection of column names & values in structured tables. This webinar explains:
- What’s required to de-identify medical records under the US HIPAA privacy rule
- Typical de-identification use cases, for structured and unstructured data
- How to implement de-identification of these use cases using Spark NLP for Healthcare
After the webinar, you will understand how to de-identify data automatically, accurately, and at scale, for the most common scenarios.
Julio Bonis
Julio Bonis is a data scientist working on Spark NLP for Healthcare at John Snow Labs. Julio has broad experience in software development and design of complex data products within the scope of Real World Evidence (RWE) and Natural Language Processing (NLP). He also has substantial clinical and management experience – including entrepreneurship and Medical Affairs. Julio is a medical doctor specialized in Family Medicine (registered GP), has an Executive MBA – IESE, an MSc in Bioinformatics, and an MSc in Epidemiology.
State-of-the-art named entity recognition with BERT
Deep neural network models have recently achieved state-of-the-art performance gains in a variety of natural language processing (NLP) tasks. However, these gains rely on the availability of large amounts of annotated examples, without which state-of-the-art performance is rarely achievable. This is especially inconvenient for the many NLP fields where annotated examples are scarce, such as medical text.
Named entity recognition (NER) is one of the most important tasks for development of more sophisticated NLP systems. In this webinar, we will walk you through how to train a custom NER model using BERT embeddings in Spark NLP – taking advantage of transfer learning to greatly reduce the amount of annotated text to achieve accurate results. After the webinar, you will be able to train your own NER models with your own data in Spark NLP.
Veysel Kocaman
Veysel Kocaman is a Senior Data Scientist and ML Engineer at John Snow Labs and has a decade long industry experience. He is also pursuing his PhD in CS as well as giving lectures at Leiden University (NL) and holds an MS degree in Operations Research from Penn State University. He is affiliated with Google as a Developer Expert in Machine Learning.