Skip to content

Optimizing Data Lakes: A Pipeline to Aggregate Small Files

Photo of Grant
Hosted By
Grant and Keith G.
Optimizing Data Lakes: A Pipeline to Aggregate Small Files

Details

Small files are the bane of a data lake, increasing your query times and processing costs. However, often you don’t get to control the data that you receive. For example, CloudTrail writes one file for each account and region, approximately every 15 minutes; dozens or even hundreds a day, some of which only have a few events.

An Athena query against the raw CloudTrail data might take minutes to execute, most of that time is due to the overhead of reading each file. By comparison, after aggregating the CloudTrail logs into one file per day, the same query takes only a few seconds.

In this talk, Keith Gregory walks through a data pipeline that uses Lambda to aggregate these files into a form that can be queried efficiently. He looks at the general design of such a pipeline, how to trigger it, how to monitor it, and how to be resilient to processing errors.

Host/Sponsor: Chariot Solutions

Chariot Solutions is software development consultancy located in the Greater Philadelphia area. We provide small teams of experienced, multi-talented engineers, who work closely with your development team. The result has been over two decades of successful projects, in application development and data engineering. If you have a challenge, let's work together to find the right solution.

Agenda

5:45 Meet and Greet, pizza provided.
6:15 Presentation.

If you connect via Zoom, beware that audio might not be great, as I'll be using my laptop's built-in microphone. Also be aware that we'll be recording the presentation.

Photo of Greater Philadelphia AWS User Group group
Greater Philadelphia AWS User Group
See more events
This is a hybrid event.
In Person
Chariot Solutions
515 Pennsylvania Ave · Fort Washington, PA
Online event
This event has passed