Northwest Arkansas Cloud-Native Engineering Meetup – First Session
Details
Modern microservices systems often behave well in development but fail unexpectedly under production load. One common failure pattern is the retry storm, where automated retries multiply traffic and overwhelm downstream services. In Kubernetes-based platforms, a single timeout can trigger waves of retries. As retries increase load, autoscaling introduces more concurrency and tail latency collapses before traditional CPU or memory alerts appear.
This session explores practical techniques for detecting and preventing retry storms using cloud-native observability and reliability controls.
We will cover:
• Why retry storms occur in distributed microservices
• Detecting retry amplification using distributed tracing
• Observability patterns with OpenTelemetry and Prometheus
• Correct retry strategies, backoff, and jitter
• Reliability guardrails for Kubernetes workloads
Agenda
11:00 aM — Welcome and introductions
11:10 PM — Talk: Preventing Retry Storms in Kubernetes Microservices
11:50 PM — Architecture discussion and Q&A
11:20 PM — Open discussion: debugging production microservices
12:00 PM — Wrap-up
Who should attend
• Software engineers
• DevOps engineers
• Platform engineers
• Site reliability engineers (SREs)
• Architects
• Students interested in cloud-native systems
If you build or operate Kubernetes platforms, microservices architectures, or distributed systems, this session will provide practical techniques for understanding and preventing cascading failures in production environments.
