Skip to content

LightOn AI Meetup: Sparsity for Efficient Long Sequence Generation of LLMs

Photo of Iacopo Poli
Hosted By
Iacopo P. and Igor C.
LightOn AI Meetup: Sparsity for Efficient Long Sequence Generation of LLMs

Details

The LightOn AI Meetup is dedicated to discussing the latest advancements in the field of large language models. At this meetup, we will have the opportunity to learn from and network with some of the leading researchers and practitioners in the field. Whether you are a seasoned LLM researcher or just curious about the field, we hope you will join us for this exciting meetup!

This event takes place at 16:00, Paris time (UTC+1)

***

Agenda
16:00 – Introduction by LightOn

***

16:05 – Sparsity for Efficient Long Sequence Generation of LLMs
by Beidi Chen, Assistant Professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University, and Visiting Research Scientist at FAIR, Meta

Abstract: Large language models (LLMs) have sparked a new wave of exciting AI applications, but they are computationally expensive at inference time. Sparsity is a natural approach to reduce this cost, but existing methods either require costly retraining, have to forgo LLM's in-context learning ability, or do not yield wall-clock time speedup on modern hardware. In this talk, I will show how sparsity can help overcome two major bottlenecks in LLM inference, model and KV cache IOs, and unlock the possibility of handling infinitely long sequences.

First, we show Heavy-Hitter Oracle (H2O), a KV cache eviction policy that drastically reduces the memory footprint of these transient states. Our approach is based on an observation that a small portion of tokens contributes most of the value when computing attention scores – Heavy-Hitters. H2O improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29x, 29x, and 3x on OPT-6.7B and OPT-30B. With the same batch size, H_2O can reduce the latency by up to 1.9x.

Then we present Streaming LLM, a simplification to H2O based on a further finding on heavy hitters called attention sink – only keeping the KV of initial tokens will largely recover the LLM performance. It enables LLMs trained with a finite length attention window to generalize to infinite sequence length without any fine-tuning. Specifically, StreamingLLM can enable Llama-2, MPT, Falcon, and Pythia to perform stable and efficient language modeling with up to 4 million tokens and more.

Finally, we present Dejavu, a system that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware-aware implementation that reduces model weight loading IOs. Dejavu can reduce the inference latency of OPT-175B by over 2x compared to the state-of-the-art FasterTransformer, and over 6$x compared to the widely used Hugging Face implementation, without compromising model quality.

Bio: Beidi Chen is an Assistant Professor in the Department of Electrical and Computer Engineering at Carnegie Mellon University. She is a Visiting Research Scientist at FAIR, Meta. Before that, she was a postdoctoral scholar at Stanford University. She received her Ph.D. from Rice University in 2020 and B.S. from UC Berkeley in 2015. Her research focuses on efficient machine learning. Specifically, she designs and optimizes algorithms and models on modern hardware to accelerate large machine learning systems. Her work has won a best paper runner-up at ICML 2022, a best paper award at IISA 2018, and a best paper award at USENIX LISA 2014. She was selected as a Rising Star in EECS by MIT in 2019 and UIUC in 2021.

Photo of LightOn Artificial Intelligence meetup group
LightOn Artificial Intelligence meetup
See more events