This talk will provide practical insight on how to build scalable data streaming machine learning pipelines to process large datasets in real-time using Python and popular frameworks such as Kafka, SpaCy, and Seldon. We will be covering a case study performing automated content moderation on Reddit comments in real-time. Our dataset will consist of 200k Reddit comments from /r/science, 50,000 of which have been removed by moderators. We will be handling the stream data in a Kubernetes cluster, and the stream processing will be handled using the stream processing library Kafka. We will be running the end-to-end pipeline in Kubernetes with various components leveraging SKLearn, SpaCy, and Seldon.
We will then dive into fundamental concepts on stream processing such as windows, watermarking, and checkpointing, and we will show how to use each of these frameworks to build complex data streaming pipelines that can perform real-time processing at scale by building, deploying, and monitoring a machine learning model which will process production incoming data.