Skip to content

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

Network event
179 attendees from 111 groups hosting
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

Details

Details
IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites
This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

About the presenter
Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance
The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

Photo of Data, Cloud and AI in Madrid group
Data, Cloud and AI in Madrid
See more events
FREE