Journey of two streaming frameworks - Spark Streaming and Kafka Streams

This is a past event

212 people went

Imperva

Derech Menachem Begin 125 · Tel Aviv-Yafo

How to find us

How to find us Imperva Tel Aviv office is at Migdal HaYovel. Switch elevator in 25th floor lobby. Building's parking lot price is 22NIS from 16:30. Imperva's reception phone #: 03-6840101

Location image of event venue

Details

18:00 - 18:30 - Mingling
18:30 - 19:30 - Realtime Data Pipelines Using Spark Streaming - Yulia Stolin - Recommendation Group Architect at Outbrain
19:30 - 20:15 - How we built our time-series and tops processing system based on Kafka-Streams, from the naïve approach to the working one - Omer Litov Engineering Manager Data Infrastructure @ Imperva

Title:
Realtime Data Pipelines Using Spark Streaming - Yulia Stolin - Recommendation Group Architect at Outbrain

Abstract:
Having near real-time inputs is extremely important, when you running high scale recommendation systems.
During this session, I will present our journey from batch-based to real-time analytics.
We implemented a data-pipeline using Spark Streaming on top of Kafka for real-time, accurate and precise decision making. I will introduce the main components of our architecture, data managements, and lesson learned from the process.
Finally we will overview different use-cases and their architecture such as
* Building and running real-time predictive analytics using Contextual Multi-Armed Bandit Models for UI ABTest optimisation.
* Running predictive CTR (click through rate) estimation based on real-time data inputs using weighted linear regression models.
* Building the analytics reports combining real-time data

In the end of the session, you will be familiar with Lambda Architecture and Streaming concepts.
You will also learn how to use Spark to combine real-time and batch analytics, and become more familiar with Spark’s capabilities.

Bio:
I have 15 years of hands-on experience in software architecture, specialising in building high volume, scalable, high-performance, distributed data systems. Expertise in BigData, NoSQL, Architecture and Development. I’m working at Outbrain as a Recommendation Group Architect. I have two sons and like to spend my free time traveling with the family, playing tennis, and swimming

Title:
How we built our time-series and tops processing system based on Kafka-Streams, from the naïve approach to the working one - Omer Litov Engineering Manager Data Infrastructure @ Imperva

Abstract:
We were faced with the challenge of building a backend system for a new type of dashboards. The new dashboards consist of both time-series graphs, and tops tables (with virtually unlimited key range). We already had a working proprietary backend system for time-series processing, but it didn’t support unlimited key ranges, and started to show performance issues. Instead of modifying the existing system, we decided to try a new and promising framework – Kafka Streams.
Most big data framework are very complicated, requiring big efforts to set up correctly and maintain. That makes the idea behind Kafka-Streams very appealing, it is a simple all-in-one library, running as a dependency withing your microservice. No need to set up complicated processing clusters, schedulers, or even a DB, everything is inside. Combined with it’s super simple API, you can set up a simple Kafka-Streams application within minutes.
That is the theory anyway, the reality turns out to be a bit more complex. Due to different issues I will describe, we avoided using Kafka-Streams high level API, and switched to a lower level API, and setting up a dedicated Cassandra DB. After all the modifications we did, we ended up with a very robust and performant system.

Bio:
Omer Litov Engineering Manager Data Infrastructure @ Imperva