Fixing small files performance issues in Apache Spark, using DataFlint

Name: Fixing small files performance issues in Apache Spark, using DataFlint
Start: 2024-04-10T16:00:00+01:00
End: 2024-04-10T16:40:00+01:00

Hosted by Vered W.

Big Data Demystified - London

Details

One of the big challenges in big data is interacting with the storage layer, especially in the data lake where we are the one who manages the files and partitions.
One of the most common performance problems in data lakes is working with small files.
In this lecture we will learn about:
* Why it's important to read and write files in best-practice size
* How Apache Spark under the hood interact with files, and how it relates to Spark Tasks
* How we can easily detect and fix small files problem (by using the open source library DataFlint)
* How to handle small files problems when using storage formats such as delta lake & iceberg.

Lecturer: Meni Shmueli- founder and author of DataFlint.(https://github.com/dataflint/spark).
Ex-81 unit, Ex-Ziprecruiter and Ex-Granulate.
Passionate about everything related to Big Data, and about working with data teams to solve their day-to-day challenges.
Over the years helped dozens of companies improve performance, debug issues and improve dev velocity in the big data world, and is currently trying to solve performance observability in big data with DataFlint.

Big Data Demystified - London

Fixing small files performance issues in Apache Spark, using DataFlint

Big Data Demystified - London

Details

Related topics

You may also like