The advent of big data promises to revolutionize medicine by making it more personalized and effective, but big data also presents a grand challenge of information overload. For example, tumor sequencing has become routine in cancer treatment, yet interpreting the genomic data requires painstakingly curating knowledge from a vast biomedical literature, which grows by thousands of papers every day.
Electronic medical records contain high-definition patient information for speeding up clinical trial recruitment and drug development, but curating such real-world evidence from clinical notes can take hours for a single patient. Natural language processing (NLP) can play a key role in interpreting big data for precision medicine. In particular, machine reading can help unlock knowledge from text by substantially improving curation efficiency.
However, standard supervised methods require labeled examples, which are expensive and time-consuming to produce at scale.
In this talk, I’ll present Project Hanover, where we overcome the annotation bottleneck by combining deep learning with probabilistic logic, by exploiting self-supervision from readily available resources such as ontologies and databases, and by leveraging domain-specific pretraining on unlabeled text. This enables us to extract knowledge from tens of millions of publications, structure real-world data for millions of cancer patients, and apply the extracted knowledge and real-world evidence to supporting precision oncology.