콘텐츠로 건너뛰기

[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

네트워크 이벤트
참석자 182명 주최하는 111개 그룹에서
[AI Alliance] GneissWeb: Preparing High Quality Data for LLMs at Scale

세부 정보

Details
IBM recently released GneissWeb, a large dataset yielding around 10 trillion tokens that caters to the data quality and quantity requirements of training Large Language Models. In this talk i will do a deep dive on the philosophy behind this dataset, where it stands w.r.t the other datasets out there, how to recreate it based on the tools IBM has open sourced and some performance figures with it. This talk will be a followup of the talk given by Shahrokh Daijavad of IBM in the month of March.

Prerequisites
This is a follow up to our March 6, 2025 session “Introducing GneissWeb - a state-of-the-art LLM pre-training dataset“:

About the presenter
Bishwaranjan Bhattacharjee (LinkedIn), Senior Technical Staff Member and Master Inventor, IBM Research

About the AI Alliance
The AI Alliance is an international community of researchers, developers and organizational leaders committed to support and enhance open innovation across the AI technology landscape to accelerate progress, improve safety, security and trust in AI, and maximize benefits to people and society everywhere. Members of the AI Alliance believe that open innovation is essential to develop and achieve safe and responsible AI that benefit society rather than benefit a select few big players.

Photo of IBM Developer Meetup group
IBM Developer Meetup
더 많은 이벤트 보기
온라인 이벤트
참석자에게 공개되는 링크
무료