LightOn AI Meetup: Creating a Large Dataset for Pretraining LLMs


Details
The LightOn AI Meetup is dedicated to discussing the latest advancements in the field of large language models. At this meetup, we will have the opportunity to learn from and network with some of the leading researchers and practitioners in the field. Whether you are a seasoned LLM researcher or just curious about the field, we hope you will join us for this exciting meetup!
This event takes place at 16:00, Paris time (UTC+1)
***
Agenda
16:00 – Introduction by LightOn
***
16:05 – Creating a large dataset for pretraining LLMs
by Guilherme Penedo, ML Research Engineer at HuggingFace 🤗
Abstract: Large language models have been proliferating at an impressive rate and can achieve human-like performance on a large number of tasks. But how exactly does one go about creating an LLM pretraining dataset? We will go through the different steps involved, look at what some recent dataset papers have done (RefinedWeb, Dolma, Yi) and explore open source tools (datatrove) that make the task of processing a large amount of text data accessible and easy to scale.
Bio: With a background in Aerospace Engineering, Guilherme first entered the ML world as an intern and then full-time employee at LightOn. He was part of the Falcon team, where he was in charge of creating the pretraining dataset for the Falcon LLM: the RefinedWeb dataset. After working on the Falcon project, he joined HuggingFace, where he currently maintains the open source data processing library `datatrove`, and works on improving pretraining datasets, as a member of the HuggingFace Science Team.
The event will be hosted on Google Meet and recorded.

LightOn AI Meetup: Creating a Large Dataset for Pretraining LLMs