Name: High-Performance AI Agent Inference Optimizations + vLLM vs. SGLang vs. TensorRT
Start: 2025-06-17T08:00:00+04:00
End: 2025-06-17T09:00:00+04:00

**Zoom link**: [https://us02web.zoom.us/j/82308186562](https://us02web.zoom.us/j/82308186562)

**Talk #0: Introductions and Meetup Updates**
by Chris Fregly and Antje Barth

**Talk #1: LLM Engineers Almanac + GPU Glossary + Inference Benchmarks for vLLM, SGLang, and TensorRT + Inference Optimizations by Charles Frye @ Modal**
Just as applications rely on SQL engines to store and query structured data, modern LLM deployments need “LLM engines” to manage weight caches, batch scheduling, and hardware-accelerated matrix operations. A recent survey of 25 open-source and commercial inference engines highlights rapid gains in usability and performance, demonstrating that the software stack now meets the baseline quality for cost-effective, self-hosted LLM inference [arxiv.org](https://arxiv.org/abs/2505.01658?utm_source=chatgpt.com). Tools like Modal’s LLM Engine Advisor further streamline adoption by benchmarking throughput and latency across configurations, offering engineers ready-to-use code snippets for deployment on serverless cloud infrastructure.

[https://modal.com/llm-almanac/advisor](https://modal.com/llm-almanac/advisor)

**Talk #2: High-Performance Agentic AI Inference Systems by Chris Fregly**
High-performance LLM inference is critical for mass adoption of AI agents. In this talk, I will demonstrate how to capture the full capabilities of today’s GPU hardware using highly-tuned inference compute like vLLM and NVIDIA Dynamo for ultra-scale autonomous AI agents. Drawing on recent breakthroughs, I'll show how co-designing software with cutting-edge hardware can address the scaling challenges of the ultra-scale inference environments required by AI agents. This talk is from Chris' upcoming book called AI Systems Performance Engineering: Optimizing GPUs, CUDA, and PyTorch.

[https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/](https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/)

**Zoom link**: [https://us02web.zoom.us/j/82308186562](https://us02web.zoom.us/j/82308186562)

**Related Links**
Github Repo: [http://github.com/cfregly/ai-performance-engineering/](http://github.com/cfregly/ai-performance-engineering/)

O'Reilly Book: [https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/](https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/)

YouTube: [https://www.youtube.com/@AIPerformanceEngineering](https://www.youtube.com/@AIPerformanceEngineering)

Generative AI Free Course on DeepLearning.ai: [https://bit.ly/gllm](https://bit.ly/gllm)

Chris Fregly

AI Performance Engineering Meetup (Dubai)

Technology

Big Data

TensorFlow

Machine Learning

Predictive Analytics

Data Mining

Python

Artificial Intelligence

Data Science

Apache Spark

CUDA: Compute Unified Device Architecture

Kubernetes

Neural Networks

Every 3rd Tuesday of the month until October 20, 2025

High-Performance AI Agent Inference Optimizations + vLLM vs. SGLang vs. TensorRT

Online event

Share this event

High-Performance AI Agent Inference Optimizations + vLLM vs. SGLang vs. TensorRT

Details