Apache Arrow - In Theory & In Practice; Vectorized Processing


Details
• What we'll do
Join us for talks from the PMC Chair of the Apache Arrow project, Jacques Nadeau, and Arrow committer Siddharth Teotia. Details below.
6:00p: Doors open – hang out, have food and drinks, and chat!
7:00p: Talks begin
8:00p - 9:00p: hang out, drinks, chat
We'll have food and drinks for everyone. Our kind host for this meetup is Thumbtack. Please note:
You must RSVP so we can provide your name to the security desk at the building entrance.
Check-in will close at 7:15; we can’t guarantee that you’ll be able to get in after that time.
Speaker: Jacques Nadeau
Title: Apache Arrow: In Theory & Practice
Apache Arrow is designed to make things faster. It’s focused on speeding communication between systems as well as processing within any one system.
In this talk, Jacques will start by discussing what Arrow is and why it was built. This will include covering an overview of the key components, goals, vision and current state. Jacques will then take the audience through a detailed engineering review of how we used Arrow to solve several problems when building the Apache-Licensed Dremio product. This will include talking about Arrow performance characteristics, working with Arrow APIs, managing memory, sizing Arrow vectors, and moving data between processes and/or nodes. We’ll also review several code examples of specific data processing implementations and how they interact with Arrow data.
Lastly we’ll spend a short amount of time on what’s next for Arrow. This will be a highly technical talk targeted towards people building data infrastructure systems and complex workflows.
Speaker: Bryan Cutler
Title: Apache Arrow usage in Apache Spark, Pandas and Vectorized UDFs
Beginning with Apache Spark 2.3.0, Apache Arrow has been employed in several places in Spark to increase performance and usability for Python users. In this talk, Bryan will discuss the problems faced by Spark and how Arrow was able to solve these. He will cover the different ways to currently use Arrow in Spark and go over some specific examples. Finally, he will talk about his on-going work in this area and possible future improvements.
Speaker: Siddharth Teotia
Title: Vectorized query processing using Apache Arrow
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth will outline the different types of vectorized query processing in Dremio using Apache Arrow.
Columnar data has become the de facto format for building high-performance query engines that run analytical workloads. Apache Arrow is an in-memory columnar data format that houses canonical in-memory representations for both flat and nested data structures. It is a natural complement to on-disk formats like Apache Parquet and Apache ORC. Dremio’s query processing engine leverages the columnar format of Apache Arrow and Parquet for in-memory and on-disk representations respectively.
• What to bring
• Important to know
You must RSVP so we can provide your name to the security desk at the building entrance.
Check-in will close at 7:15; we can’t guarantee that you’ll be able to get in after that time.

Apache Arrow - In Theory & In Practice; Vectorized Processing