Skip to content

Details

PyTorch ATX is combining with the vLLM community on September 17th for a hands-on look at the next generation of AI inference pipelines. We’ll explore the full modern stack—from aggressive model-size reductions like INT4/INT8 quantization and pruning, dynamic batching, paged-attention memory tricks, and multi-node scheduling. We'll dive into vLLM—today’s most popular open-source engine for high-throughput LLM inference— and then learn how to deploy at larger scale using the llm-d project.

Presenters include:

  • “Getting started with inference using vLLM” - Steve Watt, PyTorch ambassador
  • “An intermediate guide to inference using vLLM - PagedAttention, Quantization, Speculative Decoding, Continuous Batching and more” - Luka Govedič, vLLM core committer
  • “vLLM Semantic Router - Intelligent Auto Reasoning Router for Efficient LLM Inference on Mixture-of-Models” - Huamin Chen, vLLM Semantic Router project creator
  • “Combining Kubernetes and vLLM to deliver scalable, distributed inference with llm-d” - Greg Pereira, llm-d maintainer

Expect deeply technical talks, live demos, and open Q&A with the engineers building and running these systems.

When: September 17, 2025 - 5:30PM to 8:30PM
Where: Voltron Room - Capital Factory (1st Floor of Omni Hotel) in Austin, TX

Light food and beverages will be provided.

Email pytorchatx@gmail.com for any questions

Events in Austin, TX
Artificial Intelligence
Deep Learning
Machine Learning
Data Science
PyTorch

Members are also interested in