Skip to content

Benchmarking Stream Processing Frameworks and Parquet optimizations

Photo of Nico Poggi
Hosted By
Nico P. and Bas H.
Benchmarking Stream Processing Frameworks and Parquet optimizations

Details

Hi all, I'm happy to announce another Spark Meetup. This time Giselle van Dongen, Lead Data Scientist at Klarrio, will talk about her research topic: Benchmarking Stream Processing Frameworks. Covering performance results and best practices for Spark Structured Streaming, Storm, Flink and Kafka Streams. A second talk will be on the widely used Parquet columnar format and optimization opportunities to speed up your Spark jobs.

The meetup will take place/is sponsored by Databricks.

Agenda: 18:00 Arrive, mingle, food, drinks etc.

18:30 The Parquet format and performance optimization opportunities
by Boudewijn Braams (Databricks)
Apache Parquet is a popular open-source columnar storage format, has built-in support for coarse-grained predicate pushdown, as it explicitly stores column value statistics at different levels of granularity. Data in Parquet can be encoded and compressed using a variety of different schemes. Boudewijn will present this widely used format and guide us through some optimization opportunities both as a user and a developer to speed up our Spark workloads.

19:00 Benchmarking Stream Processing Frameworks: a Testimony
by Giselle van Dongen (Klarrio)
Due to the increasing interest in real-time processing, many stream processing frameworks were developed. However, no clear guidelines have been established for choosing a framework and designing efficient processing pipelines. Our work is a first step towards filling this gap by establishing a benchmark methodology for fine-grained benchmarking of common operations on multiple metrics: latency, peak throughput, sustainable throughput, memory usage, and CPU utilization. We implemented this benchmark for four popular stream processing frameworks: Spark (both Streaming and Structured Streaming), Storm, Flink and Kafka Streams.

Giselle van Dongen is Lead Data Scientist at Klarrio specializing in real-time data analysis, processing and visualization. Concurrently she is a PhD researcher at Ghent University, teaching and benchmarking real-time distributed processing systems such as Spark Streaming, Structured Streaming, Flink and Kafka Streams. In this talk, she will give insight into some of the hurdles and realizations when benchmarking Stream Processing Frameworks like Spark Streaming.

19:45 Q&A, mingle, food, drinks.

21:30 End of the meetup/everybody out

Hope to see you there, Niels

Photo of Data Council Amsterdam - NL Data Engineering & Science group
Data Council Amsterdam - NL Data Engineering & Science
See more events