Big data meetup (Hybrid) hosted by LinkedIn


Details
Date: Nov 4th, 2022
Time: 3pm - 6:30pm IST
Venue: This is a hybrid event. The in-person part is being hosted by Linkedin at its Bangalore office. For those who would prefer to join online, the URL is - https://linkedin.zoom.us/s/96524781425
Contact: Aman Goel (Linkedin) at +91 81059 49670
Agenda:
3:00 - Event opens
3:15 - Cyclical trends in big data (Arvind Jayprakash, InMobi)
3:55 - Opal : Mutable Database Over Immutable Data lake (Aditya Gupta / Vamsi Korada, LinkedIn)
4:35 - Break
4:45 - Multi-stage query engine - a new shape for Apache Druid (Laksh Singla / Karan Kumar, Imply)
5:25 - Bloom filter optimization in Apache Spark (Mahesh & Kapil, Microsoft)
6:05 onwards - Networking
1. Cyclical trends in big data
Big data tech trends have exhibited a pendulum effect across various dimensions such as flat files v/s well catalogued tables, colocation v/s disaggregation of compute and storage, transaction guarantees v/s eventual consistency and so on. Embracing the oscillatory nature of solutions in this space is essential to avoid the constantly falling behind new trends. This talk goes over what drives these cyclical shifts so that one can understand not just why we are, where we are, but also provide an intuition for where things could head.
Speakers
Arvind Jayaprakash is currently the SVP of Technology at Glance. He has been with the InMobi Group for the past 14 years wherein he played the role of principal software architect and also ran the platform engineering teams. His expertise lies in the area of building internet businesses that processes billions of transactions each day. Prior to joining InMobi, he worked at Yahoo! on various high scale consumer properties.
2. Opal : Mutable Database Over Immutable Data lake
It is challenging to reflect online table updates, inserts, deletes in the data lake system, because the latter tends to be immutable. To balance the data quality, latency, scan performance, and system scalability (in terms of both compute resources and operational cost), we developed Opal. We use Opal to ingest mutable data, like records from databases (Oracle, MySQL, Espresso, Venice, etc.). It builds a mutable dataset on top of the immutable file system without eagerly reconciling the ingested data files.
Speakers
Aditya is a senior software engineer at LinkedIn, working with DataLake Storage team, primarily focussing on Databases Ingestion and Mutable Data storage in Datalake. Previously he has worked on stabalizing and optimizing exabyte scale anonymized data warehouse and enabling compliance on exabyte scale data lake.
Vamsi is a senior software engineer at LinkedIn, working with the Big data platform team, primarily focussing on Database Ingestion and Mutable data storage in Datalake. Prior to opal, he worked on data compliance and anonymised data warehousing solutions on offline datalake at LinkedIn.
3. Multi-stage query engine - a new shape for Apache Druid
Druid already has a query engine that works exceptionally well for interactive workloads. With the MSQ, druid users can use druid to better handle use cases that require long-running, heavyweight queries. One such use case is batch ingesting data into druid itself. In this talk, we will cover the architecture and design of multi-stage query engine and how we are using MSQ for SQL-based batch ingestion. We will cover how the new engine does shuffling, partitioning, and segment sizing. We will also go over how we extended calcite to include custom grammar support required for SQL-based ingestion. In the end, we will wrap everything up with a live demo.
Speakers
Laksh Singla is a software engineer at Imply, working on the Apache Druid project, primarily on the newly launched Multi Stage Query engine. He has been contributing to the project for about a year now.
Karan Kumar has been working in Data teams for more than 10 years contributing to various OSS projects like Falcon, Pig, Kafka, and Presto. For the past year, he is actively working on core druid development with Imply focusing on the newly launched Multi Stage Query engine.
4. Bloom filter optimization
In the last few years Spark has emerged as one of the leading engines for processing big data. With the increase in demand for real time query analytics over petabytes of data, lots of improvements are done in the spark query execution layer. In this talk, we see how bloom filter is used in Spark to reduce the execution time and enhancements done around bloom filter implementation in Azure Synapse Spark engine.
Speakers:
Mahesh is currently working as a Principal Software engineer in Microsoft Synapse Spark team focusing on performance engineering and query optimization.
Kapil Singh is Software Engineer 2 at Microsoft as part of Azure Data Synapse team. He has been with Microsoft for around 2 years and works on Query Optimization area of Apache Spark engine. He has worked on bloom filter joins and performance analysis of Spark runtime.
COVID-19 safety measures

Big data meetup (Hybrid) hosted by LinkedIn