🤖Efficient Transformers: Training & Inference on a Budget
Details
Welcome to the Madrid ML Meetup 🤗. If you want to train or serve LLMs without unlimited GPUs, this meetup is what you're looking for!
In this meetup, we will distill the current state of the art in practical efficiency. We compare approaches, share configurations, and focus on what drives improvements in throughput, memory, and cost. 📈💰
What we will cover with paper highlights
- Approximations of Attention 🔍
LinearAttention, Linformer, Performer, RoPE for stable positional encoding. GQA to shrink KV state. Sglang for efficient structured LM programs. - Selective State Spaces SSMs 🧭
Mamba and Mamba 2 for linear time sequence modeling, and when SSMs shine for long context. - Hardware-based Techniques ⚙️
FlashAttention v1 and v2 for IO awareness and improved parallelism.
FP8 and mixed precision, fused optimizers, gradient checkpointing, and how to stack them safely. - Mixture of Experts MoE 🧠 DeepSeekMoE for specialization and MetaShuffling to accelerate Llama-style MoE inference.
Format
Talk plus paper spotlights, open Q&A focused on trade-offs, and very friendly networking. 🎤🍻
You will leave with
- A mental map of the efficiency landscape from attention to SSMs to kernels to MoE. 🗺️
- Concrete tips for cutting memory and latency, including KV cache tactics, precision choices, PEFT, and batching. ✂️🧮
- Reference settings and heuristics to try on your own models. 🧰
Who should attend
ML engineers, infra folks, researchers, and product-minded builders. Intermediate or higher PyTorch and Transformers experience helps, but curiosity is enough. 🤝
This event is organized by Machine Learning Circle, Universidad Politécnica de Madrid, Universidad Rey Juan Carlos, Apolo AI, and ARQUIMEA Research Center in collaboration with the Madrid City Council. 🏦
Logistics
🗣️ Language: English / Spanish
🍻 Networking: Drinks and snacks after the session