Name: PyTorch Data Loader Tuning + GPU Cross-Architecture Optimizations: CUDA and AMD
Start: 2025-05-19T09:00:00-07:00
End: 2025-05-19T10:00:00-07:00

**Zoom link**: [https://us02web.zoom.us/j/82308186562](https://us02web.zoom.us/j/82308186562)

**Talk #0: Introductions and Meetup Updates**
by Chris Fregly and Antje Barth

**Talk #1: Solving Bottlenecks with Data Input Pipeline with PyTorch Profiler and TensorBoard**
by Chaim Rand, et al.

Based on this Medium post: https://medium.com/data-science/solving-bottlenecks-on-the-data-input-pipeline-with-pytorch-profiler-and-tensorboard-5dced134dbe9

**Talk #2: How to Write Cross-Architecture Kernels: NVIDIA CUDA and AMD ROCm (a.k.a "CUDA for AMD")**
by Quentin Anthony, Cross-Platform Kernel Engineer @ Zyphra

New models such as DeepSeek-R1 and Llama-4 are being deployed across AMD and NVIDIA GPUs, but how are cross-hardware kernels written? In my talk, we'll discuss considerations such as kernel sizing and cross-architecture optimization when writing kernels across different SIMD hardware.

**Zoom link**: [https://us02web.zoom.us/j/82308186562](https://us02web.zoom.us/j/82308186562)

**Related Links**
Github Repo: [http://github.com/cfregly/ai-performance-engineering/](http://github.com/cfregly/ai-performance-engineering/)
O'Reilly Book: [https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/](https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/)
YouTube: [https://www.youtube.com/@AIPerformanceEngineering](https://www.youtube.com/@AIPerformanceEngineering)
Generative AI Free Course on DeepLearning.ai: [https://bit.ly/gllm](https://bit.ly/gllm)

Chris Fregly

Antje Barth

AI Performance Engineering Meetup (San Francisco, Global)

Technology

Predictive Analytics

High Scalability Computing

Artificial Intelligence Programming

Artificial Intelligence Applications

Apache Spark

Data Science

Machine Learning

TensorFlow

Big Data

Deep Learning

Neural Networks

Artificial Intelligence

Natural Language Processing

Kubernetes

PyTorch

PyTorch Data Loader Tuning + GPU Cross-Architecture Optimizations: CUDA and AMD

Online event

Share

AI Performance Engineering Meetup (San Francisco, Global)

PyTorch Data Loader Tuning + GPU Cross-Architecture Optimizations: CUDA and AMD

AI Performance Engineering Meetup (San Francisco, Global)

Details

Members are also interested in