AI-Augmented DevOps: From Incident Detection to Recovery
Details
# Description
Modern DevOps has automated software delivery, but operational complexity continues to grow. Engineers still spend significant time debugging deployments, correlating logs, managing incidents, and recovering production systems.
This session explores how AI is actively reshaping DevOps workflows beyond code generation and into operational assistance. Through a live demo using a production-grade Kubernetes cluster on KinD, attendees will follow a real operational scenario: deploying a microservices application, triggering a production-style payment-service cascade failure caused by database connection exhaustion, analyzing live pod logs and operational signals, and using Claude AI to investigate root cause and drive recovery actions.
Rather than focusing solely on AI-generated code, this talk highlights a broader shift: AI participating in operational reasoning and helping engineers accelerate the path between detection, diagnosis, and remediation. The central message is clear — AI is not replacing DevOps engineers. AI is reducing operational cognition.
# Outline
- The Ops Problem Nobody Talks About
- Why DevOps automation solved delivery but not operational understanding
- The detection-to-diagnosis gap: why 2am incidents take 40 minutes to understand
- Honest framing: AI as augmentation, not autonomy
- The workflow pattern: Detection → Diagnosis → Decision → Execution → Verification
- The System at Rest — Healthy State
- Live dashboard showing 8 microservices, all healthy, flat charts
- Real metrics flowing from KinD cluster via metrics-collector sidecar
- Brief mention of Chaos Mesh as a production chaos engineering discipline
- Live Demo — Incident Injection and Investigation
- Act 1: Triggering the Failure
- Click “Simulate Incident” on the dashboard — triggers a real k6 load generator Job in Kubernetes
- 50 concurrent users hammer payment-service, exhausting the Postgres connection pool (max 5 connections)
- Watch real pod restart counts climb in kubectl get pods -w alongside dashboard degradation
- Act 2: AI-Assisted Investigation
- INC-2847 fires: payment-service CRITICAL, 1,842 users affected, checkout returning 503s
- Navigate to incident page — real pod logs stream live (LIVE badge visible)
- Click “Analyze with AI” — Claude streams root cause analysis in under 5 seconds from real cluster logs
- Key moment: the engineer reads the analysis, questions it, decides — AI compresses investigation, human makes the call
- Follow-up prompt: “what happens if we scale pods before increasing DB connection limit?” — Claude reasons about ordering dependencies
- Act 3: Recovery
- Generate Recovery Plan — 5-step human-driven checklist, not an auto-execute button
- Run ./scripts/recover.sh in terminal — kubectl scale deployment payment-service --replicas=6
- Manually tick each checklist item as steps are executed — human decision visible at every step
- Dashboard returns to HEALTHY — INC-2847 moves to RESOLVED
- Patterns, Limits and What’s Next
- The workflow pattern to take home: Detection → Diagnosis → Decision → Execution → Verification
- Where AI genuinely helps in ops today: log summarisation, signal correlation, blast radius estimation, runbook generation
- Where AI still fails: novel failure modes, ambiguous causality, institutional knowledge, stateful systems
- Honest prediction: detection-to-diagnosis gap shrinks from 40 minutes to 2 minutes for 80% of incidents within 2 years
- Not because AI replaces SREs — because AI becomes the first responder that hands a structured brief to the human
- Q&A
# Additional Notes
- Target Audience
- DevOps Engineers and SREs
- Cloud Architects and Platform Engineers
- Engineering Leaders wanting a practical, realistic perspective on AI in operations
- The talk will include live demonstrations of AI tools and techniques.
- Audience will need some knowledge of the existing AI libraries and tools or have worked in AI project
