Parquet Optimisations and Building Spark Data Pipelines

Join waitlist?

54 on waitlist

Share:
Location image of event venue

What we'll do

Join us for the next Apache Spark London Meetup! In anticipation of Spark+AI Summit Europe we have a couple of talks from the event to give you a taster! As usual there will be some food and refreshments and an opportunity to network as well as some great talks! So join us for an evening of Apache Spark!

Title: The Parquet Format and Performance Optimization Opportunities

Speaker: Boudewijn Braams (Databricks)

Abstract
Apache Parquet is an open-source columnar storage format and one of most widely used storage formats in the Spark ecosystem. Understanding the intricacies of your storage format is important for optimizing your workloads, given that I/O is expensive and that the storage layer is the entry point for any query execution. In this talk we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives. We will dive deeper into specifics of the Parquet format: representation on disk, physical data organization and encoding schemes. Equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and leveraging partitioning schemes. We will learn about the evil that is ‘many small files’, and will discuss the recently released open-source Delta Lake format in relation to this and Parquet in general. This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark with tangible tips and tricks.

Bio
Boudewijn Braams is a Software Engineer at Databricks based in Amsterdam. He is part of the Storage & I/O team, one of the teams focussing on Databricks Runtime performance and stability. In this team, he has worked on improving Parquet robustness, the Delta caching layer and cloud storage connectors. Prior to starting full-time, he did his MSc thesis at Databricks, exploring early filtering techniques like predicate pushdown in the context of Parquet and the Databricks Runtime. He holds a joint Master’s degree in Computer Science from the University of Amsterdam and the Vrije Universiteit.

Title: Best Practices for Building and Deploying Data Pipelines in Apache Spark

Speaker: Vicky Avison (Cox Automotive UK)

Abstract
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, I will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. I’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations. I’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll discuss new approaches and best practices about what I believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

Bio:
Vicky is the Lead Data Engineer at Cox Automotive Data Solutions. She has over 5 years' experience writing high-performance applications in MapReduce and Spark. She graduated from the University of Warwick with a Master of Mathematics degree in 2013 and, after a brief stint in Android development, has been solving data problems ever since. She now spends most of her days building and optimizing data pipelines, and is co-creator of Waimak, an open-source framework that makes it easier to create complex data flows in Apache Spark.