Towards Writing Scalable Spark Apps & Pyspark and the Apache Arrow Integration

Apache Spark+AI London
Apache Spark+AI London
Public group
Location image of event venue


Please join us for the next Spark London Meetup! We have two talks discussing building spark apps and Pyspark and the Apache Arrow integration. As usual, there will be beer and pizza available courtesy of our sponsors Capgemini and Databricks - so please do come along!

Networking and drinks will be from 6:30pm with the talks starting around 7pm.

Title: Towards Writing Scalable Spark Applications

Speaker: Philipp Brunenberg

Abstract: When beginning to use Spark we have the choice between two roads to go down: We can either sit down and leverage the convenience of high-level APIs to implement the use cases we came for directly. Usually, we achieve this with trial, error and StackOverflow. By doing so, we rely on Spark to magically execute our workload in the hopefully most efficient way. Most developers would stop here. Or, we start our journey with a different approach by firstly gathering an understanding of Spark's concepts and what is happening internally. From my experience, most people, and also most companies, tend to take the former approach and start with implementing their use cases right away. This certainly valid approach, however, often leaves us out in the rain as performance issues arise when we try to scale our projects.

Throughout this talk, we will go on a walk through the most important Spark (Core) internal components to gain a deeper understanding of how parallelization is achieved. Based on these insights, we will illuminate some of the most common performance pitfalls and analyze where they originate. No matter if you are an experienced Spark user or want to leverage all of its beauty right from the start - this talk gives practical advice on how to write better Spark code.

Bio: Philipp is a free-lance data science and big data consultant, supporting his clients to bring data-driven use cases to life. He is passionate about helping companies to create innovative applications, to boost the existing ones, and to educate the team on how to write scalable code. When he works together with in-house development teams, he always desires to leave behind a different way of thinking about their challenges and how to solve them. He has been speaking at various events to help people to gain a better understanding of how Spark is designed and its most fundamental concepts.

Title: How does that PySpark thing work? And why Arrow makes it faster?

Speaker: Ruben Berenguel

Abstract: Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, likewise did the constant improvement of the optimisers (Catalyst and Tungsten). But, with Spark 2.3 PySpark has sped up tremendously thanks to the (still experimental) addition of the Arrow serialisers.

In this talk, we will learn how PySpark has improved its performance in Apache Spark 2.3 by using Apache Arrow. To do this, we will travel through the internals of Spark to find how Python interacts with the Scala core, and some of the internals of Pandas to see how data moves from Python to Scala via Arrow.

Bio: Ruben Berenguel is a big data engineer consultant and occasional contributor for Spark (especially PySpark). PhD in Mathematics, he moved to data engineering where he works mostly with Scala, Python and Go designing and implementing big data pipelines in London and Barcelona.