6:00 - 6:30pm Refreshment reception
6:30 - 7:30pm Data Platform Observability @ Intuit
7:30 - 8:00pm How to Build Reliable Modern Data Pipelines Using AI and DataOps [Unravel Data]
Data Platform Observability @ Intuit [6:30 - 7:30pm]
Can you fly a plane without an instrumentation panel for speed, altitude, etc. With the growing plethora of data technologies, there is a need for an instrumentation panel for data platforms to track quality, timeliness, change management, cost, etc. Observability focusses on providing highly granular insights into the behavior of systems along with rich context. In this session, we cover 4 lightning talks (15 mins each) on different aspects of observability.
1. Change alerting for Aurora MySQL with lineage integration
Source database schema changes can have a significant impact on downstream data ingestion, reporting and analytics. This talk covers automated change alerting by capturing database schema changes from AWS Aurora MySQL, analyzing the source table lineage for impact and proactive alerting to respective data teams before changes go into effect.
2. Multi-cluster EMR Observability
In contrast to single cluster on-premise Hadoop deployments, the best practice for AWS EMR is deploying multiple clusters serving different use-cases. How do you operationalize and monitor 10s of clusters? This talk covers E2E visibility and alerting across EMR clusters.
3. CostBuddy: Where are the dollars going?
As organizations move to the cloud, budgeting, tracking, and optimizing dollar spending in the cloud is becoming a critical capability. The topic of cost observability often gets little attention. This talk will describe our challenges with cost accountability and budgeting as we transitioned to operate the Data 100% in the cloud and the CostBuddy tool that we have developed for cost observability.
4. Raw Data & Metric Observability
In a data driven world, quality of data and insights is extremely important. Decisions based on bad data can significantly impact business and customer confidence. This talk covers how we are using ML algorithms and detect anomalies in data quality and business metrics. The alerting proactively prevents consumption of bad data/insights for decision making.
How to Build Reliable Modern Data Pipelines Using AI and DataOps [7:30 - 8:00pm]
Organizations are building strategic data pipelines that generate insights from a mountain of internal and external data sources. For example, a Customer 360 data pipeline may combine customer data from
multiple business channels, including stores, online, social media, and third party demographic data.
Modern data pipelines have multiple components such as Kafka, Spark, NoSQL, RedShift, Kubernetes, each of which generates its own log files that contain thousands of non-correlated events. These pipelines are inherently complex and unfortunately fail for many reasons. Hunting for the root cause of a pipeline failure from messy, raw, and distributed logs is hard for performance experts, and a nightmare for
data operations teams tasked with managing application SLAs. This gets harder as the complexity, scale, and speed of modern data pipelines increase. In this talk, Prof. Shivnath Babu explains how to apply performance monitoring techniques and artificial intelligence to your data pipelines and supporting big data systems to keep your applications running reliably, in your own data center or in the public cloud. Topics covered include application auto-tuning, root-cause failures of distributed applications, SLA management for streaming data pipelines, and holistic cluster optimization.