Name: Dynamic/Adaptive RL-based Inference Tuning + Accelerated PyTorch with Mojo/MAX
Start: 2025-07-22T08:00:00+04:00
End: 2025-07-22T09:00:00+04:00

**Zoom link**: [https://us02web.zoom.us/j/82308186562](https://us02web.zoom.us/j/82308186562)

**Talk #0: Introductions and Meetup Updates**
by Chris Fregly and Antje Barth

**Talk #1: Building Accelerated PyTorch Operations with Mojo and the MAX runtime by Ehsan Kermani @ Modular (the Mojo folks)**

Ehsan will dive deep into the Mojo interfaces that enables developers to write PyTorch custom ops directly in Mojo. He’ll walk through how the interfaces work, show examples like a Mojo-accelerated Deep learning model such as Whisper and explain how this opens the door to integrating MAX and Mojo into existing PyTorch workflows.

**Talk #2: Dynamic and Adaptive AI Inference Serving Optimization Strategies with CUDA and vLLM by Chris Fregly, Author of AI Systems Performance Engineering**

Ultra-large language model (LLM) inference on modern hardware requires dynamic runtime adaptation to achieve both high throughput and low latency under varying conditions. A static “one-size-fits-all” approach to model-serving optimizations is no longer sufficient.

Instead, state-of-the-art model serving systems use adaptive strategies that adjust parallelism, numerical precision, CUDA-kernel scheduling, and memory usage on the fly. This talk explores these advanced techniques including dynamic parallelism switching, precision scaling, real-time cache management, and reinforcement learning (RL)-based tuning.

By the end of this talk, you will understand best practices for ultra-scale LLM inference. You will learn how to orchestrate an inference engine that monitors its own performance and adapts in real time to maximize efficiency.

**Zoom link**: [https://us02web.zoom.us/j/82308186562](https://us02web.zoom.us/j/82308186562)

**Related Links**
Github Repo: [http://github.com/cfregly/ai-performance-engineering/](http://github.com/cfregly/ai-performance-engineering/)
O'Reilly Book: [https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/](https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/)
YouTube: [https://www.youtube.com/@AIPerformanceEngineering](https://www.youtube.com/@AIPerformanceEngineering)
Generative AI Free Course on DeepLearning.ai: [https://bit.ly/gllm](https://bit.ly/gllm)

Chris Fregly

AI Performance Engineering Meetup (Dubai)

Technology

Big Data

TensorFlow

Machine Learning

Predictive Analytics

Data Mining

Python

Artificial Intelligence

Data Science

Apache Spark

CUDA: Compute Unified Device Architecture

Kubernetes

Neural Networks

Every 3rd Tuesday of the month until October 20, 2025

Sello Tseka

Haden Pereira

UdayKumarReddy S

Usman Naeem Khokhar

Vee Di

Aaqib

Shiresh

Dynamic/Adaptive RL-based Inference Tuning + Accelerated PyTorch with Mojo/MAX

Online event

Share this event

Dynamic/Adaptive RL-based Inference Tuning + Accelerated PyTorch with Mojo/MAX

Details