AI Performance Engineering Meetup (Dubai)

Dubai, AE

4.057 leden · Openbare groep

Georganiseerd doorAntje and 2 others

Lid worden van deze groep

Lid worden van deze groep

Wat we doen

This meetup is focused on AI Performance Engineering including GPUs, CUDA, PyTorch, TensorFlow, Kubernetes, Optimizations, High-Throughput Training Clusters and Low-Latency Inference Clusters.

Aankomende evenementen (4+)

Alles weergeven

di 17 jun 2025, 04:00 UTCHigh-Performance AI Agent Inference Optimizations + vLLM vs. SGLang vs. TensorRT
Koppeling zichtbaar voor deelnemers
Zoom link: https://us02web.zoom.us/j/82308186562

Talk #0: Introductions and Meetup Updates
by Chris Fregly and Antje Barth

Talk #1: LLM Engineers Almanac + GPU Glossary + Inference Benchmarks for vLLM, SGLang, and TensorRT + Inference Optimizations by Charles Frye @ Modal
Just as applications rely on SQL engines to store and query structured data, modern LLM deployments need “LLM engines” to manage weight caches, batch scheduling, and hardware-accelerated matrix operations. A recent survey of 25 open-source and commercial inference engines highlights rapid gains in usability and performance, demonstrating that the software stack now meets the baseline quality for cost-effective, self-hosted LLM inference arxiv.org. Tools like Modal’s LLM Engine Advisor further streamline adoption by benchmarking throughput and latency across configurations, offering engineers ready-to-use code snippets for deployment on serverless cloud infrastructure.

https://modal.com/llm-almanac/advisor

Talk #2: High-Performance Agentic AI Inference Systems by Chris Fregly
High-performance LLM inference is critical for mass adoption of AI agents. In this talk, I will demonstrate how to capture the full capabilities of today’s GPU hardware using highly-tuned inference compute like vLLM and NVIDIA Dynamo for ultra-scale autonomous AI agents. Drawing on recent breakthroughs, I'll show how co-designing software with cutting-edge hardware can address the scaling challenges of the ultra-scale inference environments required by AI agents. This talk is from Chris' upcoming book called AI Systems Performance Engineering: Optimizing GPUs, CUDA, and PyTorch.

https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links
Github Repo: http://github.com/cfregly/ai-performance-engineering/

O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/

YouTube: https://www.youtube.com/@AIPerformanceEngineering

Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm
14 deelnemers+9
di 22 jul 2025, 04:00 UTCDynamic/Adaptive RL-based Inference Tuning + Accelerated PyTorch with Mojo/MAX
Koppeling zichtbaar voor deelnemers
Zoom link: https://us02web.zoom.us/j/82308186562

Talk #0: Introductions and Meetup Updates
by Chris Fregly and Antje Barth

Talk #1: Building Accelerated PyTorch Operations with Mojo and the MAX runtime by Ehsan Kermani @ Modular (the Mojo folks)

Ehsan will dive deep into the Mojo interfaces that enables developers to write PyTorch custom ops directly in Mojo. He’ll walk through how the interfaces work, show examples like a Mojo-accelerated Deep learning model such as Whisper and explain how this opens the door to integrating MAX and Mojo into existing PyTorch workflows.

Talk #2: Dynamic and Adaptive AI Inference Serving Optimization Strategies with CUDA and vLLM by Chris Fregly, Author of AI Systems Performance Engineering

Ultra-large language model (LLM) inference on modern hardware requires dynamic runtime adaptation to achieve both high throughput and low latency under varying conditions. A static “one-size-fits-all” approach to model-serving optimizations is no longer sufficient.

Instead, state-of-the-art model serving systems use adaptive strategies that adjust parallelism, numerical precision, CUDA-kernel scheduling, and memory usage on the fly. This talk explores these advanced techniques including dynamic parallelism switching, precision scaling, real-time cache management, and reinforcement learning (RL)-based tuning.

By the end of this talk, you will understand best practices for ultra-scale LLM inference. You will learn how to orchestrate an inference engine that monitors its own performance and adapts in real time to maximize efficiency.

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links
Github Repo: http://github.com/cfregly/ai-performance-engineering/
O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/
YouTube: https://www.youtube.com/@AIPerformanceEngineering
Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm
3 deelnemers
di 19 aug 2025, 04:00 UTCGPU, CUDA, and PyTorch Performance Optimizations
Koppeling zichtbaar voor deelnemers
Zoom link: https://us02web.zoom.us/j/82308186562

Talk #0: Introductions and Meetup Updates
by Chris Fregly and Antje Barth

Talk #1: GPU, PyTorch, and CUDA Performance Optimizations

Talk #2: GPU, PyTorch, and CUDA Performance Optimizations

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links
Github Repo: http://github.com/cfregly/ai-performance-engineering/
O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/
YouTube: https://www.youtube.com/@AIPerformanceEngineering
Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm
3 deelnemers
di 16 sep 2025, 04:00 UTCGPU, CUDA, and PyTorch Performance Optimizations
Koppeling zichtbaar voor deelnemers
Zoom link: https://us02web.zoom.us/j/82308186562

Talk #0: Introductions and Meetup Updates
by Chris Fregly and Antje Barth

Talk #1: GPU, PyTorch, and CUDA Performance Optimizations

Talk #2: GPU, PyTorch, and CUDA Performance Optimizations

Zoom link: https://us02web.zoom.us/j/82308186562

Related Links
Github Repo: http://github.com/cfregly/ai-performance-engineering/
O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/
YouTube: https://www.youtube.com/@AIPerformanceEngineering
Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm
3 deelnemers

Afgelopen evenementen (79)

Alles weergeven

di 20 mei 2025, 04:00 UTCPyTorch Data Loader Tuning + GPU Cross-Architecture Optimizations: CUDA and AMD
Dit evenement is verstreken
23 deelnemers+18

AI Performance Engineering Meetup (Dubai)

Wat we doen

Aankomende evenementen (4+)

Afgelopen evenementen (79)

Groepskoppelingen

Gerelateerde onderwerpen