Breaking the Ice around Apache Iceberg to drive Next-Gen Analytics

Name: Breaking the Ice around Apache Iceberg to drive Next-Gen Analytics
Start: 2023-05-11T17:00:00-07:00
End: 2023-05-11T20:00:00-07:00
Location: Cloudera

Network event

123 attendees from 3 groups hosting

Hosted By

Future of D.

Breaking the Ice around Apache Iceberg to drive Next-Gen Analytics

Details

This will be a hybrid event with a Zoom as well as in person.

In this session we will cover:

Hive & Impala integration with Iceberg - Vincent Kulandaisamy, Bill Zhang, Aman Sinha
Performance improvement in Spark Iceberg integration - Asif Shahid
Incremental View Maintenance with Iceberg, Coral, and DBT, LinkedIn, Walaa Eldin Moustafa, Aastha Agrrawal

You can join the meeting virtually here:

Join Zoom Meeting: [https://cloudera.zoom.us/j/99956441254](https://www.google.com/url?q=https://cloudera.zoom.us/j/99956441254&sa=D&source=calendar&ust=1684009037027155&usg=AOvVaw3TGecBu0XO55fV5Ve5Ff4B)

Session Details:

Hive & Impala integration with Iceberg: In this talk, Cloudera will present how Cloudera has integrated data warehouse compute engines Hive and Impala with Iceberg for high speed analytics at scale. We will also share what advanced features and performance improvements are provided through the integration. The team will showcase these advanced features with demos.
Performance improvement in Spark Iceberg integration: This talk will outline how we are extending the concept of Dynamic Partition Pruning (DPP) of Spark with Iceberg to improve the performance of Broadcast Hash Join queries. Given that DPP applies only if the join involves a partition column, the new work is leveraging on the already broadcasted Keys of Hash Join, available on driver and executors to do Range based pruning of Manifest files (on driver) as well as Data Files, RowStore Chunks on executors, leveraging the min max stats, for non partition columns too. The work also enables partition based Runtime Filters of Iceberg to be used on executors for further pruning, in case the partitioning strategy involves transformation on the column.
Incremental View Maintenance with Iceberg, Coral, and DBT: In this talk, the Data Infrastructure @ LinkedIn team will present how the integration of DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialised and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialised views. The team will demo the integration using DBT and some example queries.

Speakers:

Vincent Kulandaisamy- Vince is a Senior Principal Software Engineer at Cloudera. He has 24+ years of experience in database kernel internals architecture,design and development, building OLTP and DW database products. He is responsible for DW, Open Data Lake architectures and innovations and integration of Apache Iceberg with Cloudera Data Platform across various form factors.
Bill Zhang: Bill is Senior Director of Data Warehouse Product Management at Cloudera. He is responsible for Apache Hive and Apache Iceberg integration with Cloudera Data Platform.
Asif Shahid: Asif works in the Iceberg team of Cloudera as a Principal Engineer. He has 24+ years of experience in the field of distributed caching, SQL and object querying engine development. He has worked to enhance Spark’s Catalyst optimizer for complex plans (Constraint Propagation Rule rewrite, PushDownPredicate rewrite,, avoiding re-run of optimizer rules by identifying immutable sections). In Cloudera, he is working on Iceberg and Spark layer.
Aman Sinha: Aman is a Director of Engineering at Cloudera. He leads a team responsible for the SQL query optimization and shared services across Hive and Impala as part of Cloudera Data Warehouse. He has extensive experience developing query processing engines for big data systems, relational and NoSQL databases. He is a committer and PMC member in prominent open source projects
Walaa Eldin Moustafa: Walaa Eldin Moustafa is a Senior Staff Software Engineer at LinkedIn, where he works on building big data infrastructure and solutions for enabling unified and performant data processing systems across different table formats, compute engines and language APIs. Walaa holds a PhD degree in Computer Science from the University of Maryland at College Park. He has co-authored a number of database publications at various database conferences including SIGMOD, ICDE, and IEEE Big Data in topics that focus on modern applications of relational, deductive, and graph database management systems.
Aastha Agrrawal: Aastha Agrrawal is a Senior Software Engineer at LinkedIn and works on Big Data infrastructure. She is currently focusing on open source frameworks to streamline data management across various computing engines, while prioritizing developer productivity. Aastha holds a master's degree in Computational Science and Engineering from Georgia Institute of Technology.