Skip to content

Speeding up PySpark with Arrow, by Ruben Berenguel

Photo of
Hosted By
Ferran Galí i R. and 4 others


Hello Sparklers!

We're starting the new year eagers for knowledge, and we will begin warming up our brains with a bit of PySpark and Apache Arrow by Ruben Berenguel.
You might already know him if you signed up into our Slack, he's a pretty active user! If you haven't, do it here:

See you Thursday 31th of January, 19:00!

Thanks to Trovit Search to let us make the event in their offices. We are really thankful for the venue and pizzas 🍕+ 🍺!!

How does that PySpark thing work? And why Arrow makes it faster?

Back in ye olde days of Spark, using Python with Spark was an exercise in patience. Data was moving up and down from Python to Scala, being serialised constantly. Leveraging SparkSQL and avoiding UDFs made things better, likewise did the constant improvement of the optimisers (Catalyst and Tungsten). But, after Spark 2.3 PySpark has sped up tremendously thanks to the addition of the Arrow serialisers.

Ruben Berenguel is a big data engineer consultant and occasional contributor for Spark (especially PySpark). PhD in Mathematics, he moved to data engineering where he works mostly with Scala, Python and Go designing and implementing big data pipelines in London and Barcelona.
LifullConnect (Trovit)
Avinguda Diagonal, 601 08028 Barcelona · Barcelona
Google map of the user's next upcoming event's location