Skip to content

[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Network event
491 attendees from 109 groups hosting
Photo of IBM Community
Hosted By
IBM C.
[AI Alliance] Introducing Gneissweb: A State-Of-The-Art LLM Pre-training Dataset

Details

Agenda

  • Quick intro about AI Alliance (5 mins)
  • GneissWeb presentation (40 mins)
  • Q&A (10 mins)
  • Wrapup

Session: Introducing GneissWeb - a state-of-the-art LLM pre-training dataset
At IBM, responsible AI implies transparency in training data: Introducing GneissWeb (pronounced โ€œniceWebโ€), a state-of-the-art LLM pre-training dataset with ~10 Trillion tokens derived from FineWeb, with open recipes, results, and tools for reproduction!

In this session we will go over how we created GneissWeb and discuss tools and techniques used. We will provide code examples that you can try at your leisure.

๐Ÿ‘‰ > 2% avg improvement in benchmark performance over FineWeb
๐Ÿ‘‰ Huggingface page
๐Ÿ‘‰ Data prep kit detailed recipe
๐Ÿ‘‰ Data prep kit bloom filter for quick reproduction
๐Ÿ‘‰ Recipe models for reproduction
๐Ÿ‘‰ announcement
๐Ÿ‘‰ Paper

Session Type
Presentation

Audience
LLM app developers, data scientists, data engineers

Technical Level
Beginner โ€“ Intermediate

Prerequisites
None

Speaker: Shahrokh Daijavad, Research Scientist @ IBM Almaden Research Center
Shahrokh Daijavad, a distinguished Research Scientist in the Watsonx Data Engineering group at IBM Almaden Research Center, has a rich background in Edge Computing and Data Engineering. He earned his B.Eng. and Ph.D. in electrical engineering from McMaster University and spent years at IBM T. J. Watson Research Center. His recent research focuses on AI@Edge and Data Engineering for IBM Watsonx AI offerings.

About the AI Alliance
The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

Photo of Data, Cloud and AI in Tel Aviv-Yafo group
Data, Cloud and AI in Tel Aviv-Yafo
See more events