PyTorch Data Loader Tuning + GPU Cross-Architecture Optimizations: CUDA and AMD
Details
Zoom link: https://us02web.zoom.us/j/82308186562
Talk #0: Introductions and Meetup Updates
by Chris Fregly and Antje Barth
Talk #1: Solving Bottlenecks with Data Input Pipeline with PyTorch Profiler and TensorBoard
by Chaim Rand, et al.
Based on this Medium post: https://medium.com/data-science/solving-bottlenecks-on-the-data-input-pipeline-with-pytorch-profiler-and-tensorboard-5dced134dbe9
Talk #2: How to Write Cross-Architecture Kernels: NVIDIA CUDA and AMD ROCm (a.k.a "CUDA for AMD")
by Quentin Anthony, Cross-Platform Kernel Engineer @ Zyphra
New models such as DeepSeek-R1 and Llama-4 are being deployed across AMD and NVIDIA GPUs, but how are cross-hardware kernels written? In my talk, we'll discuss considerations such as kernel sizing and cross-architecture optimization when writing kernels across different SIMD hardware.
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
Github Repo: http://github.com/cfregly/ai-performance-engineering/
O'Reilly Book: https://www.amazon.com/Systems-Performance-Engineering-Optimizing-Algorithms/dp/B0F47689K8/
YouTube: https://www.youtube.com/@AIPerformanceEngineering
Generative AI Free Course on DeepLearning.ai: https://bit.ly/gllm
