Optimizing Data Lakes: A Pipeline to Aggregate Small Files


Details
Small files are the bane of a data lake, increasing your query times and processing costs. However, often you don’t get to control the data that you receive. For example, CloudTrail writes one file for each account and region, approximately every 15 minutes; dozens or even hundreds a day, some of which only have a few events.
An Athena query against the raw CloudTrail data might take minutes to execute, most of that time is due to the overhead of reading each file. By comparison, after aggregating the CloudTrail logs into one file per day, the same query takes only a few seconds.
In this talk, Keith Gregory walks through a data pipeline that uses Lambda to aggregate these files into a form that can be queried efficiently. He looks at the general design of such a pipeline, how to trigger it, how to monitor it, and how to be resilient to processing errors.
Host/Sponsor: Chariot Solutions
Chariot Solutions is software development consultancy located in the Greater Philadelphia area. We provide small teams of experienced, multi-talented engineers, who work closely with your development team. The result has been over two decades of successful projects, in application development and data engineering. If you have a challenge, let's work together to find the right solution.
Agenda
5:45 Meet and Greet, pizza provided.
6:15 Presentation.
If you connect via Zoom, beware that audio might not be great, as I'll be using my laptop's built-in microphone. Also be aware that we'll be recording the presentation.

Optimizing Data Lakes: A Pipeline to Aggregate Small Files