Skip to content

Details

Modern microservices systems often behave well in development but fail unexpectedly under production load. One common failure pattern is the retry storm, where automated retries multiply traffic and overwhelm downstream services. In Kubernetes-based platforms, a single timeout can trigger waves of retries. As retries increase load, autoscaling introduces more concurrency and tail latency collapses before traditional CPU or memory alerts appear.

This session explores practical techniques for detecting and preventing retry storms using cloud-native observability and reliability controls.

We will cover:
• Why retry storms occur in distributed microservices
• Detecting retry amplification using distributed tracing
• Observability patterns with OpenTelemetry and Prometheus
• Correct retry strategies, backoff, and jitter
• Reliability guardrails for Kubernetes workloads

Agenda
11:00 aM — Welcome and introductions
11:10 PM — Talk: Preventing Retry Storms in Kubernetes Microservices
11:50 PM — Architecture discussion and Q&A
11:20 PM — Open discussion: debugging production microservices
12:00 PM — Wrap-up

Who should attend
• Software engineers
• DevOps engineers
• Platform engineers
• Site reliability engineers (SREs)
• Architects
• Students interested in cloud-native systems

If you build or operate Kubernetes platforms, microservices architectures, or distributed systems, this session will provide practical techniques for understanding and preventing cascading failures in production environments.

Related topics

Distributed Systems
Cloud Native
Kubernetes
Observability

You may also like