Google NY Site Reliability Engineering (SRE) Tech Talks, 16 Dec 2025

Name: Google NY Site Reliability Engineering (SRE) Tech Talks, 16 Dec 2025
Start: 2025-12-16T18:00:00-05:00
End: 2025-12-16T20:30:00-05:00
Location: Chelsea Market

Hosted by Vlad L.

New York Site Reliability Engineering Tech Talks

Details

Google SRE NYC proudly announces our last Google SRE NYC Tech Talk for 2025.

This event is co-sponsored by sentry.io. Thank you Sentry for your partnership!

Let's farewell 2025 with three amazing interactive short talks on Site Reliability and DevOps topics! As always the event will include an opportunity to mingle with the speakers and attendees over some light snacks and beverages after the talks.

The Meetup will take place on Tuesday, 16th of December 2025 at 6:00 PM at our Chelsea Markets office in NYC. The doors will open at 5:30 pm. Pls RSVP only if you're able to attend in-person, there will be no live streaming.

When RSVP'ing to this event, please enter your full name exactly as it appears on your government issued ID. You will be required to present your ID at check in.

Agenda:
Paul Jaffre - Senior Developer Experience Engineer, sentry.io
One Trace to Rule Them All: Unifying Sentry Errors with OpenTelemetry tracing
SREs face the challenge of operating reliable observability infrastructure while avoiding vendor lock-in from proprietary APM (Application Performance Monitoring) solutions. OpenTelemetry has become the standard for instrumenting applications, allowing teams to collect traces, metrics, and logs. But raw telemetry data isn't enough. SREs need tools to visualize, debug, and respond to production incidents quickly. Sentry now supports OTLP, enabling teams to send OpenTelemetry data directly to Sentry for analysis. This talk covers how Sentry's OTLP support works in practice: connecting frontend and backend traces across services, correlating logs with distributed traces, and using tools to identify slow queries and performance bottlenecks. We'll discuss the practical benefits for SREs, like faster incident resolution, better cross-team debugging, and the flexibility to change observability backends without re-instrumenting code.
Paul’s background spans engineering, product management, UX design, and open source. He has a soft spot for dev tools and loses sleep over making things easy to understand and use.
Paul has a dynamic professional background, from strategy to stability. His time at Krossover Intelligence established a strong foundation by blending Product Management with hands-on development, and he later focused on core reliability at MakerBot, where he implemented automated end-to-end testing and drove performance improvements. He then extended this expertise in stability and scale at Cypress.io, where he served as a Developer Experience Engineer, focusing on improving workflow, contribution, and usability for their widely adopted open-source community.

Thiara Ortiz - Cloud Gaming SRE Manager, Netflix
Managing Black Box Systems
SREs often face ambiguity when managing black box systems (LLMs, Games, Poorly Understood Dependencies). We will discuss how Netflix monitors service health as black boxes using multiple measurement techniques to understand system behavior, aligning with the need for robust observability tools. These strategies are crucial for system reliability and user experience. By proactively identifying and resolving issues, we ensure smoother playback experience and maintain user trust, even as the platform continues to evolve and gain maturity. The principles shared within this talk can be expanded to other applications such as AI reliability in data quality and model deployments.

Thiara has worked at some of the largest internet companies in the world, Meta and Netflix. During her time at Meta, Thiara found a passion for distributed systems and bringing new hardware into production. Always curious to explore new solutions to complex problems, Thiara developed Fleet Scanner, internally known as Lemonaid, to perform memory, compute, and storage benchmarks on each Meta server in production. This service runs on over 5 million servers and continues to be utilized at Meta. Since Meta, Thiara has been working at Netflix as a Senior CDN Reliability engineer, and now, Cloud Gaming SRE Manager. When incidents occur and Netflix's systems do not behave as expected, Thiara can be found working and engaging the necessary teams to remediate these issues.

Andrew Espira - Platform and Site Reliability Engineer, Founding Engineer kustode
ML-Powered Predictive SRE: Using Behavioral Signals to Prevent Cluster Inefficiencies Before They Impact Production
SREs managing ML clusters often discover resource inefficiencies and queue bottlenecks only after they've impacted production services. This talk presents a machine learning approach to predict these issues before they occur, transforming SRE from reactive firefighting to proactive system optimization.
We demonstrate how to build predictive models using production cluster traces that identify two critical failure modes: (1) GPU under-utilization relative to requested resources, and (2) abnormal queue wait times that indicate impending service degradation.
The SRE practitioners will learn how to extract early warning indicators from standard cluster logs, build ML models that provide actionable confidence scores for operational decisions, and take practical steps to integrate predictive analytics into existing SRE toolchains to achieve 50%+ reduction in resource waste and queue-related incidents
This talk bridges the gap between traditional SRE observability and modern predictive analytics, showing how teams can evolve from reactive monitoring to intelligent, forward-looking reliability engineering"
Andrew has over 8 years of experience architecting and maintaining large-scale distributed systems. He is the Founding Engineer of Kustode (kustode.com), where he develops cutting-edge reliability and observability solutions for modern infrastructure in the Insurance and health care solutions space. Currently pursuing graduate studies in Data Science at Saint Peter's University, he specializes in the intersection of reliability engineering and artificial intelligence. His research focuses on applying machine learning to operational challenges, with publications in peer-reviewed venues including ScienceDirect. He's passionate about making complex systems more predictable and maintainable through data-driven approaches.
When not optimizing cluster performance or building the next generation of observability tools, Andrew enjoys contributing to open-source projects and mentoring early-career engineers in the SRE community.

Our Tech Talks series are for professional development and networking: no recruiters, sales or press please! Google is committed to providing a harassment-free and inclusive conference experience for everyone, and all participants must follow our Event Community Guidelines. The event will be photographed and video recorded.

Event space is limited! A reservation is required to attend. Reserve your spot today and share the event details with your SRE/DevOps friends 🙂

New York Site Reliability Engineering Tech Talks

Google NY Site Reliability Engineering (SRE) Tech Talks, 16 Dec 2025

New York Site Reliability Engineering Tech Talks

Details

Related topics

You may also like