Award-winning AI and NLP company John Snow Labs is invited for the 3rd year in a row to present at O’Reilly AI. This year the company will jointly present one of its most impactful case studies on using NLP to interpret medical records.
Interpreting millions of patient stories with deep learned OCR and NLP will be delivered by Alberto Andreotti, a data scientist at John Snow Labs, and Stacy Ashworth, Chief Clinical Officer at SelectData. It describes how John Snow Labs’ state-of-the-art Spark NLP for Healthcare extracts high-quality facts from medical records, with great accuracy and at scale.
Many businesses still depend on documents stored as images—from receipts, manifests, invoices, medical reports, and ID cards snapped with mobile phone cameras to contracts, waivers, leases, forms, and audit records digitized with scanners. Extracting high-quality data from these images comes with three challenges. First is OCR, as in dealing with crumpled receipts photographed from an angle in a dimly lit room. Second is NLP, extracting normalized values and entities from the natural language text. The third is building predictors or recommendations that suggest the best next action—and in particular can deal with missing, wrong, or conflicting information generated by the previous steps.
The good news is that state-of-the-art deep learning techniques, now available as open source software, can approach human accuracy in these three tasks—and do so at scale. Stacy Ashworth and Alberto Andreotti explore a case study of an AI system that reads millions of pages of patient information, gathered from hundreds of sources, resulting in a great variety of image formats, templates, and quality. They explore the solution architecture and key lessons learned in going from raw images to a deployed predictive workflow based on facts extracted from the scanned documents.
The talk will introduce Spark NLP for Healthcare – a natively distributable, deep learning-based library – and its OCR capabilities. The OCR library employs adaptive scaling, rotation, and erosion to achieve a significant accuracy boost compared to Tesseract. Spark NLP applies techniques such as BERT embeddings, trainable pipelines, and DL-based sentence segmentation and spell checking that materially improve accuracy for OCR-sourced text mining. Since both libraries are native extensions of Apache Spark, a unified pipeline can be written in Python, Java, or Scala for all three stages (including ML based on the results of OCR and NLP), enabling a new level of scale, speed, and reproducibility for the entire pipeline from image to next-best action.