About us
If you are a cloud architect, DevOps engineer, or just into cloud latest trends, this is the place for you!
In this group, we will review all the hot trends and new features and tools in the cloud and DevOps world, focusing on multi-cloud.
Please join us!
Upcoming events
1

Catch your GPUs lying or Why the cloud bill just tripled?
·OnlineOnline"GPU Busy" is a lie. It simply tells you a multiprocessor is occupied - not that useful work is actually being done.
Grafana dashboards often show P4d instances humming at 95% "GPU Busy," yet inference tail-latency continues to spike and training epochs crawl. When AI infrastructure hangs, standard telemetry tools like Datadog, Prometheus, or CloudWatch hit a blind spot: they can flag a stalled node, but they cannot identify the underlying OS-level friction causing it. The result is weeks of manual debugging, stranded capacity, and inflated cloud bills.
Join us for a live teardown of the AI infrastructure black box. We will examine the gap between your application, the Linux kernel, and the CUDA runtime - —and show you how to bridge it.
You will get first-hand insights from the co-authors and maintainers of Ingero, a new open-source, eBPF-based GPU observability agent, as we track a massive GPU cluster hang through its full lifecycle:
- 🚨 Detection (The SRE View): Aggregating OTLP telemetry to monitor top-down cluster health and setting actionable alerts for "Stranded Capacity" instead of generic timeouts.
- 🔍 Investigation (The Engineering View): Using eBPF with <2% overhead to trace CUDA workloads natively. We will pinpoint the exact host-side network stall or CPU context switch starving the GPU.
- 💸 Resolution (The FinOps View): Translating kernel-level metrics into dollars. Learn how to map idle GPU cycles back to specific K8s namespaces or Slurm Job IDs to generate exact waste reports.
2 attendees
Past events
27


