Measuring P95/P99 Latency with Prometheus and SLOs
Details
Latency is one of the most important indicators of user experience and system reliability in distributed applications. While average response times often appear healthy, tail latency (P95 and P99) frequently reveals the operational challenges experienced by users under real workloads.
This session explores practical techniques for measuring, interpreting, and improving latency in cloud-native systems using Prometheus and Service Level Objectives (SLOs).
Topics include:
- Understanding P50, P95, and P99 latency
- Why averages hide performance problems
- Measuring latency with Prometheus histograms
- Writing useful PromQL queries
- Designing meaningful SLOs
- Using latency metrics to guide engineering decisions
- Common performance anti-patterns in Kubernetes environments
The session is intended for software engineers, platform engineers, SREs, architects, and anyone interested in improving the reliability and performance of cloud-native applications.
The presentation is based on practical engineering experience, reproducible examples, and real-world performance analysis.
