Catch your GPUs lying or Why the cloud bill just tripled?
Details
"GPU Busy" is a lie. It simply tells you a multiprocessor is occupied - not that useful work is actually being done.
Grafana dashboards often show P4d instances humming at 95% "GPU Busy," yet inference tail-latency continues to spike and training epochs crawl. When AI infrastructure hangs, standard telemetry tools like Datadog, Prometheus, or CloudWatch hit a blind spot: they can flag a stalled node, but they cannot identify the underlying OS-level friction causing it. The result is weeks of manual debugging, stranded capacity, and inflated cloud bills.
Join us for a live teardown of the AI infrastructure black box. We will examine the gap between your application, the Linux kernel, and the CUDA runtime - βand show you how to bridge it.
You will get first-hand insights from the co-authors and maintainers of Ingero, a new open-source, eBPF-based GPU observability agent, as we track a massive GPU cluster hang through its full lifecycle:
- π¨ Detection (The SRE View): Aggregating OTLP telemetry to monitor top-down cluster health and setting actionable alerts for "Stranded Capacity" instead of generic timeouts.
- π Investigation (The Engineering View): Using eBPF with <2% overhead to trace CUDA workloads natively. We will pinpoint the exact host-side network stall or CPU context switch starving the GPU.
- πΈ Resolution (The FinOps View): Translating kernel-level metrics into dollars. Learn how to map idle GPU cycles back to specific K8s namespaces or Slurm Job IDs to generate exact waste reports.
