Skip to content

Details

One of the big challenges in big data is interacting with the storage layer, especially in the data lake where we are the one who manages the files and partitions.
One of the most common performance problems in data lakes is working with small files.
In this lecture we will learn about:
* Why it's important to read and write files in best-practice size
* How Apache Spark under the hood interact with files, and how it relates to Spark Tasks
* How we can easily detect and fix small files problem (by using the open source library DataFlint)
* How to handle small files problems when using storage formats such as delta lake & iceberg.

Lecturer: Meni Shmueli- founder and author of DataFlint.(https://github.com/dataflint/spark).
Ex-81 unit, Ex-Ziprecruiter and Ex-Granulate.
Passionate about everything related to Big Data, and about working with data teams to solve their day-to-day challenges.
Over the years helped dozens of companies improve performance, debug issues and improve dev velocity in the big data world, and is currently trying to solve performance observability in big data with DataFlint.

Related topics

Amazon Web Services
Apache Spark
Big Data

You may also like