AI-Augmented DevOps: From Incident Detection to Recovery

Name: AI-Augmented DevOps: From Incident Detection to Recovery
Start: 2026-07-18T10:30:00+05:30
End: 2026-07-18T12:30:00+05:30
Location: Equal Experts India Pvt Ltd

Hosted by Tanay P. and 4 others

ExpertTalks Bengaluru

Details

# Description

Modern DevOps has automated software delivery, but operational complexity continues to grow. Engineers still spend significant time debugging deployments, correlating logs, managing incidents, and recovering production systems.
This session explores how AI is actively reshaping DevOps workflows beyond code generation and into operational assistance. Through a live demo using a production-grade Kubernetes cluster on KinD, attendees will follow a real operational scenario: deploying a microservices application, triggering a production-style payment-service cascade failure caused by database connection exhaustion, analyzing live pod logs and operational signals, and using Claude AI to investigate root cause and drive recovery actions.
Rather than focusing solely on AI-generated code, this talk highlights a broader shift: AI participating in operational reasoning and helping engineers accelerate the path between detection, diagnosis, and remediation. The central message is clear — AI is not replacing DevOps engineers. AI is reducing operational cognition.

# Outline

The Ops Problem Nobody Talks About

Why DevOps automation solved delivery but not operational understanding
The detection-to-diagnosis gap: why 2am incidents take 40 minutes to understand
Honest framing: AI as augmentation, not autonomy
The workflow pattern: Detection → Diagnosis → Decision → Execution → Verification

The System at Rest — Healthy State

Live dashboard showing 8 microservices, all healthy, flat charts
Real metrics flowing from KinD cluster via metrics-collector sidecar
Brief mention of Chaos Mesh as a production chaos engineering discipline

Live Demo — Incident Injection and Investigation

Act 1: Triggering the Failure
Click “Simulate Incident” on the dashboard — triggers a real k6 load generator Job in Kubernetes
50 concurrent users hammer payment-service, exhausting the Postgres connection pool (max 5 connections)
Watch real pod restart counts climb in kubectl get pods -w alongside dashboard degradation
Act 2: AI-Assisted Investigation
INC-2847 fires: payment-service CRITICAL, 1,842 users affected, checkout returning 503s
Navigate to incident page — real pod logs stream live (LIVE badge visible)
Click “Analyze with AI” — Claude streams root cause analysis in under 5 seconds from real cluster logs
Key moment: the engineer reads the analysis, questions it, decides — AI compresses investigation, human makes the call
Follow-up prompt: “what happens if we scale pods before increasing DB connection limit?” — Claude reasons about ordering dependencies
Act 3: Recovery
Generate Recovery Plan — 5-step human-driven checklist, not an auto-execute button
Run ./scripts/recover.sh in terminal — kubectl scale deployment payment-service --replicas=6
Manually tick each checklist item as steps are executed — human decision visible at every step
Dashboard returns to HEALTHY — INC-2847 moves to RESOLVED

Patterns, Limits and What’s Next

The workflow pattern to take home: Detection → Diagnosis → Decision → Execution → Verification
Where AI genuinely helps in ops today: log summarisation, signal correlation, blast radius estimation, runbook generation
Where AI still fails: novel failure modes, ambiguous causality, institutional knowledge, stateful systems
Honest prediction: detection-to-diagnosis gap shrinks from 40 minutes to 2 minutes for 80% of incidents within 2 years
Not because AI replaces SREs — because AI becomes the first responder that hands a structured brief to the human

# Additional Notes

Target Audience
DevOps Engineers and SREs
Cloud Architects and Platform Engineers
Engineering Leaders wanting a practical, realistic perspective on AI in operations
The talk will include live demonstrations of AI tools and techniques.
Audience will need some knowledge of the existing AI libraries and tools or have worked in AI project

ExpertTalks Bengaluru

AI-Augmented DevOps: From Incident Detection to Recovery

ExpertTalks Bengaluru

Details

Related topics

You may also like