Google NY Site Reliability Engineering (SRE) Tech Talks, 17 Sep 2024


Details
Google SRE NYC proudly announces our third Tech Talk event in 2024.
We invite you to join us for an hour of short talks on Site Reliability and DevOps topics with an opportunity to meet and talk with fellow engineers over light snacks and beverages.
The in-person only event will take place at our Chelsea Markets office in NYC. The doors will open at 5:30 pm.
When RSVP'ing to this event, please enter your full first and last name, this needs to match your government issued ID you will be required to present at check in.
Agenda:
Managing Complex Migrations - Tom Elliott - Founder of a stealth startup in CI/CD space; formerly Director of Software Engineering at Yext
Migrations are often motivated by reliability, but can also harm reliability if not done with care.
Tom will explore several migrations he led over the past few years, including moving 2000+ jobs to Nomad, subsequently containerizing 2600 and moving to an HA setup with RabbitMQ.
We will discuss what went well and what went wrong in each instance, and how we applied what we learned to improve our migration competency.
Tom is an 18-year software engineer with focused expertise in DevOps. Over the past 5 years leading the SRE group at Yext, Tom oversaw the development of essential tooling and processes, including CI/CD pipelines, testing frameworks, and incident management, while guiding multiple migrations of thousands of microservices. Now, Tom is building Ocuroot, an innovative pipeline-free, YAML-free CI/CD tool designed for the Enterprise. Based in Manhattan, Tom enjoys spending spare moments procrastinating on his family book club reads.
Data SRE - an introduction - Venkat Mahalingham - Data SRE at Google (Google Maps)
Data safety is becoming increasingly important and this talk will introduce this to the audience, to open up beyond traditional losses around data integrity.
When you think of SRE, RPC services and service operations immediately come to mind - Errors, latency, managing the size and number of tasks etc.,
For most products, there is another important story - that of data flows and data sets. A critical error in data (e.g. critical highway missing a segment in its route etc.,) could have widespread consequences to users.
No amount of RPC service level reliability will protect against that risk. We need to think about safety against data loss.
Venkat is a Data SRE on Google Maps for the past 6 years. Venkat's focus is on delivering end-to-end data safety solutions, with an unwavering aim at making life better for users. He has experience with managing a wide range of data safety risks, including infrastructure, policy, and tooling. Prior to Google, he worked at Etsy, Microsoft, and helped build humanoid robots at Georgia Tech. He once tried but failed to get a selfie with Guido van Rossum.
Production Readiness 2.0: Continuous Readiness - Justin Reock - Head of Developer Relations at cortex.io
Justin will explain through real-world use cases how teams can adopt the emerging practice of metric scorecards to reduce meetings and streamline release readiness assessments using data and automation.
The list of criteria required to release a service to production, often referred to as a “production readiness standard,” is a mandatory component of reliable systems of software delivery. Aligning to these standards cross-functionally is challenging, especially when standards may need to be bypassed or changed, often at the last minute. And most importantly, systems always drift, and software that met these requirements six months ago may not still be meeting them today – so can they still be considered ready for production?
Teams often resort to time-consuming practices which are brittle and difficult to change. Cortex has pioneered the scorecard as means of driving engineering initiatives using gamification. By ingesting data from the various systems that engineers would normally check manually process are streamlined and readiness checks transformed to an always-on, continuous verification of readiness.
Justin Reock is the Head of Developer Relations for Cortex.io, and is an outspoken speaker, writer, and software practice evangelist. He has over 20 years of experience working in various software roles. He is an outspoken thought leader, delivering enterprise solutions, technical leadership, various publications and most recently, community education on developer productivity.
Disclaimer: no recruiters, sales or press are allowed. Google is committed to providing a harassment-free and inclusive conference experience for everyone, and all participants must follow our Event Community Guidelines. The event will be photographed and video recorded.

Google NY Site Reliability Engineering (SRE) Tech Talks, 17 Sep 2024