Skip to content

About us

Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds. Containers, service meshes, microservices, immutable infrastructure, and declarative APIs exemplify this approach.
These techniques enable loosely coupled systems that are resilient, manageable, and observable. Combined with robust automation, they allow engineers to make high-impact changes frequently and predictably with minimal toil.

Upcoming events

1

See all
  • Catch your GPUs lying or  Why the cloud bill just tripled?

    Catch your GPUs lying or Why the cloud bill just tripled?

    ·
    Online
    Online

    Event link: https://meet.google.com/vra-rvsk-kde

    Grafana dashboards often show P4d instances humming at 95% "GPU Busy." Yet, inference tail-latency continues to spike, and training epochs crawl.
    Standard metrics routinely misrepresent reality: "GPU Busy" simply indicates the multiprocessor is occupied - it does not mean useful work is being done.

    When AI infrastructure hangs, standard telemetry tools (like DCGM, Prometheus, or CloudWatch) hit a blindspot. They flag a stalled node but cannot identify the underlying OS-level friction causing it.
    The result is weeks of manual debugging, stranded capacity, and inflated cloud bills.

    This session explores the black box of AI infrastructure monitoring while focusing on GPU observability and GPU management for AI workloads. We will examine the gap between the application, the Linux kernel and the CUDA runtime, demonstrating how to bridge that gap using Ingero – a new open-source, eBPF-based GPU observability agent. You will get first-hand info from the co-authors and maintainers of Ingero FOSS repo.

    The discussion will track a GPU cluster hang through the full incident lifecycle:

    • The SRE View (Detection): Aggregating OTLP telemetry to monitor top-down cluster health, setting actionable alerts for "Stranded Capacity" rather than generic timeouts.
    • The Engineering View (Investigation): Using eBPF with <2% overhead to trace CUDA workloads and pinpoint the exact host-side network stall or CPU context switch starving the GPU.
    • The FinOps View (Cost Visibility): Translating kernel-level metrics into dollars by mapping idle GPU cycles back to specific processes / apps, Kubernetes namespaces or Slurm Job IDs to generate accurate waste & usage reports.

    Event link: https://meet.google.com/vra-rvsk-kde

    • Photo of the user
    2 attendees

Group links

Organizers

Members

2,012
See all