Next Meetup

Deepdive in to Spark SQL
Hi All, we're happy to announce another Spark meetup. This time we'll do a deepdive into Spark SQL. Databricks is our sponsor for this event, and is sponsoring the food and location. Agenda: 18:00: Arrive, mingle, food (pizza), drinks etc. 18:45: Correctness and Performance of Apache Spark SQL by Nico Poggi and Bogdan Ghitt In this talk, we present a comprehensive framework we developed at Databricks for assessing the correctness, stability, and performance of our Spark SQL engine. Apache Spark is one of the most actively developed open source projects, with more than 1200 contributors from all over the world. At this scale and pace of development, mistakes bound to happen. We will discuss various approaches we take, including random query generation, random data generation, random fault injection, and longevity stress tests. We will demonstrate the effectiveness of the framework by highlighting several correctness issues we have found through random query generation and critical performance regressions we were able to diagnose within hours due to our automated benchmarking tools. 19:45: An Introduction to Higher Order Functions in Spark SQL by Herman van Hovell Nested data types offer Apache Spark users powerful ways to manipulate structured data. In particular, they allow you to put complex objects like arrays, maps and structures inside of columns. This can help you model your data in a more natural way. While this feature is certainly useful, it can quite bit cumbersome to manipulate data inside of complex objects because SQL (and Spark) do not have primitives for working with such data. In addition, it is time-consuming, non-performant, and non-trivial. During this talk we will discuss some of the commonly used techniques for working with complex objects, and we will introduce new ones based on Higher-order functions. Higher-order functions will be part of Spark 2.4 and are a simple and performant extension to SQL that allow a user to manipulate complex data such as arrays. 21:00: End of the meetup/everybody out Hope to see you there, Niels

Databricks

Barbara Strozzilaan 350 · Amsterdam

    Past Meetups (12)

    What we're about

    Apache Spark is an open-source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). However, Spark is not tied to the two-stage MapReduce paradigm, and promises performance up to 100 times faster than Hadoop MapReduce, for certain applications. Spark provides primitives for in-memory cluster computing that allows user programs to load data into a cluster's memory and query it repeatedly, making it well suited to machine learning algorithms.

    Members (1,070)

    Photos (2)