Tokenizer-free language model inference

Details
A team from Aleph Alpha will talk about Tokenizer-free language model inference.
Abstract:
Traditional Large Language Models rely heavily on large, predefined tokenizers (e.g., 128k+ vocabularies), introducing limitations in handling diverse character sets, rare words, and dynamic linguistic structures. This talk presents a different approach to language model inference that eliminates the need for conventional large-vocabulary tokenizers. The system operates with a core vocabulary of only 256 byte values, processing text at the most fundamental level. It employs a three-part architecture: byte-level encoder and decoder models handle character sequence processing, while a larger latent transformer operates on higher-level representations. The interface between these stages involves dynamically creating "patch embeddings", guided by word boundaries or entropy measures. This talk will first introduce the intricacies of this byte-to-patch transformer architecture. Subsequently, we will focus on the significant engineering challenges encountered in building an efficient inference pipeline, specifically coordinating the three models, managing their CUDA graphs, and handling their respective KV caches.
🔈 Speakers:
- Pablo Iyu Guerrero and Lukas Blübaum
Agenda:
✨ 18:30 Doors open: time for networking with fellow attendees
✨ 19:00 Talk and Q&A
✨ 20:00 Mingling and networking with pizza and drinks
✨ 21:00 Meetup ends
- Where: In person, Aleph Alpha Heidelberg, Speyerer Str. 14
- When: Tuesday, May 20th
- Language: English

Tokenizer-free language model inference