The Road to Apache Spark 3.0, Koalas and Neptune Spark Meetup

This is a past event

100 people went

Location image of event venue


Join us for the next Apache Spark London Meetup! After all the excitement of Spark Summit US we thought it would be great to have a followup meetup. As usual there will be some food and refreshments and an opportunity to network as well as some great talks! So join us for an evening of Apache Spark!

Title: The Road to Upcoming Apache Spark 3.0 and Koalas: Unifying Spark and pandas APIs

Speaker: Tim Hunter (Databricks)

A talk in two halves, the first part, will present some of the recent developments that may pave the road to Apache Spark 3.0. Spark has grown in scope and features throughout the 2.x series, and the community has been working hard to crystallize what the future will look like. While a lot of these features are still being finalized, we will go over the recent announcements at the Spark+AI Summit.

In the second part, we will present Koalas, a new open-source project that was announced at the Spark + AI Summit. Koalas is a python package that implements the pandas API, which is the standard Data Science python package of choice for small data sets, on top of Apache Spark. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.

Tim Hunter is a software engineer at Databricks and is the co-creator of the Koalas project. He contributes to the Apache Spark MLlib project, as well as the GraphFrames, TensorFrames and Deep Learning Pipelines libraries. He has been building distributed Machine Learning systems with Spark since version 0.0.2 before Spark was an Apache Software Foundation project.

Title: Cooperative Task Execution For Apache Spark

Speaker: Panagiotis Garefalakis (Imperial College London)

Apache Spark has enabled users to express batch, streaming, and machine learning computations as part of the same unified runtime to share application logic, state, or to interact with each other. Examples include online machine learning, real-time data transformation and serving, low-latency event monitoring and reporting. Although Structured Streaming provides the programming interface to enable such unified computation over bounded and unbounded data, the underlying execution engine was not designed to efficiently support jobs with different requirements (i.e., latency vs. throughput) as part of the same runtime. It therefore becomes particularly challenging to schedule such jobs to efficiently utilize the cluster resources while respecting their requirements in terms of task response times.

In this talk, we will present Neptune, a new cooperative task execution framework for Spark with fine-grained control over resources such as CPU time. Neptune utilizes Scala coroutines as a lightweight mechanism to suspend task execution with sub-millisecond latency and introduces new scheduling policies that respect diverse task requirements while efficiently sharing the same runtime. Users can directly use Neptune for their unified applications as it supports all existing DataFrame, DataSet, and RDD APIs. We present an implementation of the execution model as part of Spark 2.4.0 and describe the observed performance benefits from running a number of streaming and machine learning workloads on an Azure cluster.

Panagiotis Garefalakis is a Ph.D. candidate at Imperial College London, Department of Computing. He is affiliated with the Large-Scale Data & Systems (LSDS) group and his research interests lie within the broad area of systems including large-scale distributed systems, cluster resource management, and stream processing.