Inference Disagg Prefill-Decode, RadixAttention + NVIDIA GTC 2026 Conf Recap
Details
Zoom link: https://us02web.zoom.us/j/82308186562
Talk #0: Introductions and Meetup Updates
by Chris Fregly and Antje Barth
Talk #1: Inference Engines Deep Dive: Disaggregated Serving, PagedAttention, RadixAttention, and the Modern LLM Serving Stack by Seth Weidman @ SentiLink and author of Deep Learning from Scratch (O’Reilly)
In this talk, Seth will go deep on how modern inference engines work end-to-end: disaggregated serving architectures, PagedAttention and RadixAttention, and how NVIDIA Dynamo interfaces with vLLM, SGLang, and TensorRT-LLM. He’ll also cover emerging KV-cache optimization/compression directions and practical tradeoffs for throughput, latency, and memory efficiency in production LLM systems.
Talk #2: NVIDIA GTC 2026 AI Conference Recap by Chris Fregly, author of AI Systems Performance Engineering (O'Reilly)
In this talk, Chris will present the AI and systems highlights from the NVIDIA GTC 2026 conference (happening the prior week.)
NVIDIA GTC Conference registration link:
https://www.nvidia.com/gtc/ (Use code GTC26-20 for 20% off!)
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
Github Repo: http://github.com/cfregly/ai-performance-engineering/
O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/
YouTube: https://www.youtube.com/@AIPerformanceEngineering
Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm
