Google-Scale Production Systems: Building Resilience & Velocity at Scale

Name: Google-Scale Production Systems: Building Resilience & Velocity at Scale
Start: 2026-04-19T19:00:00+05:30
End: 2026-04-19T22:00:00+05:30

Hosted by venkatesh D.

CoderRange - AI , Big data , Data Science !.

Details

🕒 3-Hour Agenda

| Segment | Duration | Description |
| ------- | -------- | ----------- |
| 1. Intro & Context | 15 min | Overview of Google-scale production challenges—scale, complexity, and impact |
| 2. SRE Fundamentals | 30 min | Error budgets, SLIs/SLOs, and reliability culture ([cloud.google.com](https://cloud.google.com/sre/?utm_source=chatgpt.com "Site Reliability Engineering (SRE) |
| 3. CI/CD at Scale | 30 min | Google's Rapid + Blaze build system, automated testing, safe rollouts |
| 4. Canary Deployments & Metrics | 35 min | Selecting effective SLIs for canaries, gradual rollouts |
| ☕ Break | 10 min | — |
| 5. Resilience Engineering | 30 min | Chaos engineering (DiRT), failure drills, playbook-driven incident response |
| 6. Observability & Monitoring | 30 min | Monitoring tiers, alert testing, instrumenting SLIs, dashboards |
| 7. Icebreaker Lab: Build a “Mini SRE Flow” | 25 min | Design a simplified CI-canary-monitor-playbook system in breakout groups |
| 8. Wrap-Up & Q&A | 20 min | Share key tools, patterns, and next steps |

## 🔍 Important Session Highlights

### ✅ 1. SRE Culture & Reliability

Understand reliability as governed by SLIs/SLOs and error budgets, avoiding unreliable “perfectionism”

### ✅ 2. CI/CD Infrastructure at Google

Learn how Rapid + Blaze enables thousands of concurrent builds, tests, and deployments with repeatable, automated release processes

### ✅ 3. Canary Deployment Best Practices

Selecting effective canary SLIs (error rates, latency, resource usage) and avoiding common pitfalls in rollout strategies

### ✅ 4. Race-Proven Resilience Engineering

Techniques like Disaster Recovery Testing (DiRT) and automated chaos drills build real production readiness

### ✅ 5. Production-Grade Observability

Ensuring robust monitoring pipelines, alert governance, and testing alerts proactively rather than reactively

***

## 🛠️ Hands-On Lab Concept

Participants group to sketch a simplified SRE pipeline:

CI → 2. Canary deployment → 3. Monitoring (SLIs) → 4. Incident & rollback playbook

Groups present basic flowcharts + discuss monitoring thresholds and incident triggers

***

## 🎯 Why This Will Thrill Your Audience

Delivers battle-tested strategies from Google’s SRE revelations
Balances theory and practice with interactive labs
Equips attendees to apply real resiliency tools and design patterns
Ideal for practitioners aiming at scale, reliability, and velocity

Join Zoom Meeting

[https://us02web.zoom.us/j/82496056794?pwd=AnGS7lBOP0HXSrkCjbXrblgrPKUiGU.1](https://www.google.com/url?q=https://us02web.zoom.us/j/82496056794?pwd%3DAnGS7lBOP0HXSrkCjbXrblgrPKUiGU.1&sa=D&source=calendar&usd=2&usg=AOvVaw29ldtDsclXi6uZv4Y-I6ej)

Meeting ID: 824 9605 6794
Passcode: 002921

CoderRange - AI , Big data , Data Science !.

Google-Scale Production Systems: Building Resilience & Velocity at Scale

CoderRange - AI , Big data , Data Science !.

Details

Related topics

You may also like