Inference Disagg Prefill-Decode, RadixAttention + NVIDIA GTC 2026 Conf Recap

Name: Inference Disagg Prefill-Decode, RadixAttention + NVIDIA GTC 2026 Conf Recap
Start: 2026-03-23T09:00:00-07:00
End: 2026-03-23T10:00:00-07:00

Hosted by Chris F. and Antje B.

AI Performance Engineering Meetup (San Francisco, Global)

Details

Zoom link: https://us02web.zoom.us/j/82308186562

Talk #0: Introductions and Meetup Updates
by Chris Fregly and Antje Barth

Talk #1: Inference Engines Deep Dive: Disaggregated Serving, PagedAttention, RadixAttention, and the Modern LLM Serving Stack by Seth Weidman @ SentiLink and author of Deep Learning from Scratch (O’Reilly)

In this talk, Seth will go deep on how modern inference engines work end-to-end: disaggregated serving architectures, PagedAttention and RadixAttention, and how NVIDIA Dynamo interfaces with vLLM, SGLang, and TensorRT-LLM. He’ll also cover emerging KV-cache optimization/compression directions and practical tradeoffs for throughput, latency, and memory efficiency in production LLM systems.

Talk #2: NVIDIA GTC 2026 AI Conference Recap by Chris Fregly, author of AI Systems Performance Engineering (O'Reilly)

In this talk, Chris will present the AI and systems highlights from the NVIDIA GTC 2026 conference (happening the prior week.)

NVIDIA GTC Conference registration link:
https://www.nvidia.com/gtc/ (Use code GTC26-20 for 20% off!)

Zoom link: https://us02web.zoom.us/j/82308186562

Inference Disagg Prefill-Decode, RadixAttention + NVIDIA GTC 2026 Conf Recap

AI Performance Engineering Meetup (San Francisco, Global)

Details

You may also like