Journey of two streaming frameworks - Spark Streaming and Kafka Streams


Details
18:00 - 18:30 - Mingling
18:30 - 19:30 - Realtime Data Pipelines Using Spark Streaming - Yulia Stolin - Recommendation Group Architect at Outbrain
19:30 - 20:15 - How we built our time-series and tops processing system based on Kafka-Streams, from the naïve approach to the working one - Omer Litov Engineering Manager Data Infrastructure @ Imperva
Title:
Realtime Data Pipelines Using Spark Streaming - Yulia Stolin - Recommendation Group Architect at Outbrain
Abstract:
Having near real-time inputs is extremely important, when you running high scale recommendation systems.
During this session, I will present our journey from batch-based to real-time analytics.
We implemented a data-pipeline using Spark Streaming on top of Kafka for real-time, accurate and precise decision making. I will introduce the main components of our architecture, data managements, and lesson learned from the process.
Finally we will overview different use-cases and their architecture such as
- Building and running real-time predictive analytics using Contextual Multi-Armed Bandit Models for UI ABTest optimisation.
- Running predictive CTR (click through rate) estimation based on real-time data inputs using weighted linear regression models.
- Building the analytics reports combining real-time data
In the end of the session, you will be familiar with Lambda Architecture and Streaming concepts.
You will also learn how to use Spark to combine real-time and batch analytics, and become more familiar with Spark’s capabilities.
Bio:
I have 15 years of hands-on experience in software architecture, specialising in building high volume, scalable, high-performance, distributed data systems. Expertise in BigData, NoSQL, Architecture and Development. I’m working at Outbrain as a Recommendation Group Architect. I have two sons and like to spend my free time traveling with the family, playing tennis, and swimming
Title:
How we built our time-series and tops processing system based on Kafka-Streams, from the naïve approach to the working one - Omer Litov Engineering Manager Data Infrastructure @ Imperva
Abstract:
We were faced with the challenge of building a backend system for a new type of dashboards. The new dashboards consist of both time-series graphs, and tops tables (with virtually unlimited key range). We already had a working proprietary backend system for time-series processing, but it didn’t support unlimited key ranges, and started to show performance issues. Instead of modifying the existing system, we decided to try a new and promising framework – Kafka Streams.
Most big data framework are very complicated, requiring big efforts to set up correctly and maintain. That makes the idea behind Kafka-Streams very appealing, it is a simple all-in-one library, running as a dependency withing your microservice. No need to set up complicated processing clusters, schedulers, or even a DB, everything is inside. Combined with it’s super simple API, you can set up a simple Kafka-Streams application within minutes.
That is the theory anyway, the reality turns out to be a bit more complex. Due to different issues I will describe, we avoided using Kafka-Streams high level API, and switched to a lower level API, and setting up a dedicated Cassandra DB. After all the modifications we did, we ended up with a very robust and performant system.
Bio:
Omer Litov Engineering Manager Data Infrastructure @ Imperva

Journey of two streaming frameworks - Spark Streaming and Kafka Streams