Building large-scale structured datasets of detailed clinical information about patient journeys is a critical tool in medical research, clinical guideline development, and real-world evidence. It is used heavily to study everything from Cancer to Covid – but is also highly challenging because of the massive and specialized effort required to abstract data from noisy & unstructured datasets. Automating clinical data abstraction historically faced three challenges.
First, each project has different guidelines on what, how, and when data should be extracted and normalized.
Second, the data is often taken from natural language documents, or a combination of structured, imaging, and document sources.
Third, near-perfect accuracy is required to enable medical decision making – so models that achieve 90% accuracy, for example, are just not good enough.
In this session, Dr. Dia Trambitas will share an end-to-end, semi-automated system composed of Spark NLP for Healthcare as the underlying NLP engine, a team-based data annotation tool used by human specialists, and an active learning pipeline that automatically applies experts’ feedback to retrain models. This system achieved a 4x speedup in real-world data & clinical data abstraction projects, enabling an order of magnitude scaleup while retaining the accuracy achieved by a manual process.