#55 Large Scale Production Engineering (LSPE) meet up

Name: #55 Large Scale Production Engineering (LSPE) meet up
Start: 2025-06-14T10:45:00+05:30
End: 2025-06-14T16:15:00+05:30
Location: Google Ananta

Hosted By

Debansu and Bathrinath R.

#55 Large Scale Production Engineering (LSPE) meet up

Details

Event Agenda and Speaker info

10:45: Registration
11:00: Tea

11:15: Opening keynote - Karthik Appigatla & Bathrinath Raveendran, Site Reliability Manager, Google

11:30: Maintaining reliability for AI inference platform, Gaurav Sharma, & Sainath Reddy, Nvidia

As AI-driven applications shift from experimentation to production, Site Reliability Engineers (SREs) are increasingly tasked with maintaining the health of AI inference platforms. But what exactly is AI inferencing, and how does it differ from conventional workloads in terms of reliability, observability, and scalability?

This talk demystifies AI inferencing for SREs and provides a hands-on view of what it takes to run inference workloads reliably on Kubernetes. We’ll start with a quick primer on AI inferencing—what it is, how it works, and the unique characteristics that impact infrastructure. Next, we'll walk through how to configure a Kubernetes cluster for serving models, including GPU scheduling, model versioning, resource isolation, and integration with inference-serving frameworks like KServe and NVIDIA Triton.

12:00: Database at scale - Cloud Spanner, Jay Shamnani & Sirish Bandi, Google

Spanner is Google's distributed database, designed to offer scalability, high availability, and external consistency. Spanner achieves high availability through replicating data to different zones of regions. With multi region configurations, a spanner instance can span across multi regions providing excellent scalability. With integration of TrueTime API, Paxos, and two-phase commits guarantees consistency for spanner.

Geo Partitioning would allow further segmentation of the data to provide better regional latencies in global databases. Along with Geo Partitioning, spanner can provide better latencies by using leader aware routing or calling directly regional endpoints. With fractional instances we can even reduce the operating cost of operating the database.

1:00: Lunch break

2:00: Scaling logging infrastructure at CRED - Nirmaljeet Singh, CRED

CRED recently migrated it's logging platform to a home grown solution which is able to handle 15 TB daily log ingestion with a P99 query latency of 4 seconds and ingestion latency of under 1 second. This transition resulted in an annual cost savings of $600K. This talk will take us through this journey.

2:45: Scaling Agentic AI Applications: The A2A Protocol - A TCP/IP Moment for AI, Spreeha Dutta, Google

Just as TCP/IP provided the fundamental standard for different computer networks to communicate and unlocked the potential of the internet, the AI world is facing a similar challenge. Today's AI agents often exist in isolation, built on specific platforms, making seamless interaction and the coordination of complex tasks within these agents reliant on costly custom integrations.

What if AI agents, regardless of who built them or which platform they run on, could discover, communicate and collaborate with each other as easily as devices connect online? That’s where the Agent-to-Agent (A2A) Protocol comes in. In this talk, we'll introduce A2A, a groundbreaking standard designed to be the universal language for AI agents.

3:15: Scaling an internal developer platform to 75,000 engineers, Turja Chaudhuri, Big4 Consulting

This talk is about our journey, learnings etc on building a platform that is deployed across 60 global locations. This platform supports 3200 business applications and now is the 2nd largest consumption of Azure in the world.

3:45: Onwards: Tea, snacks, and networking.

Events in Bengaluru, IN