Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms.
Hi all, I'm happy to announce another Spark Meetup. This time Giselle van Dongen, Lead Data Scientist at Klarrio, will talk about her research topic: Benchmarking Stream Processing Frameworks. Covering performance results and best practices for Spark Structured Streaming, Storm, Flink and Kafka Streams. A second talk will be on the widely used Parquet columnar format and optimization opportunities to speed up your Spark jobs.
The meetup will take place/is sponsored by Databricks.
Agenda: 18:00 Arrive, mingle, food, drinks etc.
18:30 The Parquet format and performance optimization opportunities
by Boudewijn Braams (Databricks)
Apache Parquet is a popular open-source columnar storage format, has built-in support for coarse-grained predicate pushdown, as it explicitly stores column value statistics at different levels of granularity. Data in Parquet can be encoded and compressed using a variety of different schemes. Boudewijn will present this widely used format and guide us through some optimization opportunities both as a user and a developer to speed up our Spark workloads.
19:00 Benchmarking Stream Processing Frameworks: a Testimony
by Giselle van Dongen (Klarrio)
Due to the increasing interest in real-time processing, many stream processing frameworks were developed. However, no clear guidelines have been established for choosing a framework and designing efficient processing pipelines. Our work is a first step towards filling this gap by establishing a benchmark methodology for fine-grained benchmarking of common operations on multiple metrics: latency, peak throughput, sustainable throughput, memory usage, and CPU utilization. We implemented this benchmark for four popular stream processing frameworks: Spark (both Streaming and Structured Streaming), Storm, Flink and Kafka Streams.
Giselle van Dongen is Lead Data Scientist at Klarrio specializing in real-time data analysis, processing and visualization. Concurrently she is a PhD researcher at Ghent University, teaching and benchmarking real-time distributed processing systems such as Spark Streaming, Structured Streaming, Flink and Kafka Streams. In this talk, she will give insight into some of the hurdles and realizations when benchmarking Stream Processing Frameworks like Spark Streaming.
19:45 Q&A, mingle, food, drinks.
21:30 End of the meetup/everybody out
Hope to see you there, Niels