Recursion Pharmaceuticals is turning drug discovery into a data science problem which entails generating petabytes of microscopy images from carefully designed biological experiments. In early 2017 the data generation effort scaled to a point where the existing batch processing system was not sufficient. New use cases required that the batch system be replaced with a streaming system. After evaluating the typical contenders in this space, e.g. Spark and Storm, we settled on using Kafka Streaming and Kubernetes instead. By building on top of Kafka and Kubernetes we were able to build a flexible, highly available, and robust pipeline with container support built in. This presentation will walk you through our thought process and explain the tradeoffs between all of these systems in light of our specific use case. We will give a high level introduction to Kafka Streaming and the workflow layer we were able to easily add on top of it that orchestrated our existing microservices. We'll also explain how we leverage Kubernetes Jobs with a custom in-memory task queue system, TaskStore, that we wrote. We've been operating at scale with these two systems for a year now with success, albeit with some war stories. In the end we find this solution much easier to work with than behemoth frameworks and, due to the robustness of these two systems, are able to operate it at much lower costs using preemptible Google Cloud instances.
Scott leads the data engineering team at Recursion and knows a thing or two about programming and distributed systems.
Ben leads engineering at Recursion and knows a thing or two about programming and machine learning.
Special thanks to Recursion Pharmaceuticals for providing food and facilities!