Skip to content

Details

Modern high-performance applications increasingly rely on CPU-GPU heterogeneous systems, yet their performance is often limited by poorly placed global synchronization barriers, synchronous memory transfers, and default-stream CUDA semantics that prevent overlap between computation and communication. This talk presents compiler-driven techniques to systematically remove these bottlenecks. I will first introduce hetero-sync motion, which safely relocates barrier instructions to enable greater CPU-GPU concurrency, and sync2async, which automatically transforms synchronous data transfers and kernel launches on the default stream into non-default-stream asynchronous calls with correct stream allocation and synchronization. Both techniques rely on precise, context-sensitive, flow-sensitive inter-procedural data-flow analyses implemented in LLVM/Clang, and deliver significant speedups on modern GPUs.
I will briefly touch upon our ongoing work on optimizing Unified Memory programs using static analysis to reduce unnecessary on-demand page migrations. Together, these efforts show how compiler analysis can unlock concurrency and efficiency in heterogeneous systems without increasing programmer burden.

Note :
1. this meetup is designed to be Asia Time Zone–Friendly.
2. this meetup will be RECORDED
3. The event is open to participants of all genders.

Related topics

Women's Social
Concurrent Programming
Compilers

You may also like