Accelerating Rare Disease Diagnosis with Automated HPO Code Extraction from Clinical Notes
Patient phenotypes – observable traits such as ataxia, muscle weakness, or developmental delay – are vital for understanding genetic disorders and advancing precision medicine. Yet these clues are often scattered across unstructured clinical notes, case reports, and biomedical literature. The Human Phenotype Ontology (HPO), with over 18,000 standardized terms, is now a global standard for genomics and rare disease research, but its value depends on accurately extracting and mapping phenotype descriptions from raw text. In rare disease diagnostics, these profiles are even more critical: a 2020 NIH study found that 97% of diagnostic decisions relied on detailed phenotypic information.
In this webinar, we’ll introduce John Snow Labs’ medical language model pipeline designed to:
- Ingest multiple documents per patient and automatically extract phenotype and gene mentions.
- Normalize terms to HPO codes and provide supporting evidence.
- Link phenotypes with gene mentions to accelerate rare disease and genomic diagnosis.
We’ll also present real-world benchmarks, comparing the accuracy of John Snow Labs’ approach with ClinPhen and frontier large language models (LLMs). Results show that our pipeline achieves higher accuracy in mapping phenotypes to HPO codes, offering researchers and clinicians a more reliable foundation for genomic analysis and rare disease diagnosis.
Gursev Pirge is a Researcher and Senior Data Scientist with demonstrated success improving the Spark NLP for Healthcare library and delivering hands-on projects in Healthcare and Life Sciences. He has strong statistical skills and presents to all levels of leadership to improve decision making. He has experience in Education, Logistics, Data Analysis and Data Science. He has a strong education background with a Doctor of Philosophy (Ph.D.) focused on Mechanical Engineering from Bogazici University.
Reserve Your Spot