Skip to content

Details

Zoom link: https://us02web.zoom.us/j/82308186562

Talk #0: Introductions and Meetup Updates
by Chris Fregly and Antje Barth

Talk #1: Inference Engines Deep Dive: Disaggregated Serving, PagedAttention, RadixAttention, and the Modern LLM Serving Stack by Seth Weidman @ SentiLink and author of Deep Learning from Scratch (O’Reilly)

In this talk, Seth will go deep on how modern inference engines work end-to-end: disaggregated serving architectures, PagedAttention and RadixAttention, and how NVIDIA Dynamo interfaces with vLLM, SGLang, and TensorRT-LLM. He’ll also cover emerging KV-cache optimization/compression directions and practical tradeoffs for throughput, latency, and memory efficiency in production LLM systems.

Talk #2: NVIDIA GTC 2026 AI Conference Recap by Chris Fregly, author of AI Systems Performance Engineering (O'Reilly)

In this talk, Chris will present the AI and systems highlights from the NVIDIA GTC 2026 conference (happening the prior week.)

NVIDIA GTC Conference registration link:
https://www.nvidia.com/gtc/ (Use code GTC26-20 for 20% off!)

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links

Github Repo: http://github.com/cfregly/ai-performance-engineering/

O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/

YouTube: https://www.youtube.com/@AIPerformanceEngineering

Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm

You may also like