Brett Larsen | The Importance of High-Quality Data in Building Your LLMs


Details
Title: The Importance of High-Quality Data in Building Your LLMs: Lessons from DBRX
Abstract: Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. However, it’s expensive to understand the impact of these domain-specific datasets since training to large FLOP scales is required to reveal significant changes to difficult and emergent benchmarks. Given this cost, how does one efficiently characterize new datasets and optimize the balance between diversity in web scrapes and information density of domain specific data? In this talk, we’ll consider the three steps we take to answer these questions with customers. First, we start by identifying quality benchmarks to guide data decisions by measuring how these benchmarks scale across a series of increasingly advanced models. Second, we perform continued pretraining on individual datasets to quickly identify which subset of benchmarks are impacted; this also provides an inexpensive way to adapt models to new domains when combined with weight averaging. Finally, we consider the technique of upsampling domain specific data in the final phase of pretraining. We show that domain upsampling both boosts performance on challenging metrics and provides a framework for further study of individual datasets by measuring how performance changes when they are removed during this last phase of training. This tool opens up the ability to study the impact of different pretraining datasets at scale but at an order of magnitude lower cost compared to full pretraining runs.
Bio: Brett is a senior research scientist at Databricks Mosaic Research and a guest researcher at the Flatiron Institute. Prior to this, he was a research fellow at the Flatiron Institute’s Centers for Computational Mathematics and Neuroscience and completed his PhD at Stanford University co-advised by Surya Ganguli and Shaul Druckmann. Brett’s research sits at the intersection of data and AI, empirically studying how neural networks learn with the goal of making it more efficient to train modern generative AI models.
Agenda:
- 18:25: Virtual doors open
- 18:30: Talk
- 19:10: Q&A session
- 19:30: Close
Sponsor: Evolution AI - Generative AI data extraction from financial documents.

Sponsors
Brett Larsen | The Importance of High-Quality Data in Building Your LLMs