What we're about
Upcoming events (1)
Join us for an evening featuring tech-talks about Apache Spark and Delta Lake at scale from LinkedIn and Databricks. This meetup is hosted and sponsored by LinkedIn. Agenda: 6:00 - 6:30 pm: Social Hour with Food & Drinks 6:30 - 6:35 pm: Introduction & Announcements 6:35 - 7:55 pm: Tech Talk-1 from LinkedIn 7:55 - 8:15 pm: Tech Talk-2 from LinkedIn 8:20 - 9:00 pm: Tech Talk-3 from Databricks Talk 1 Title: Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility Presenter: Adwait Tumbde Abstract: At LinkedIn, changes to our complex data ecosystem can place steep costs on application developers. We have designed Dali to insulate developers from physical attributes of data like storage format and location. At its core, Dali provides a catalog to define and evolve physical and virtual datasets, a dataset reader that allows applications to read datasets in different environments and a collection of development tools to manage these datasets. In this talk, we will cover * What is Dali and how it simplifies complex data ecosystem * Dali as unified data access layer at LinkedIn * DaliSpark Architecture * Roadmap including plans to open source Dali Bio: Adwait Tumbde is an engineering manager at LinkedIn and leads a team focused on simplifying data management for big data. He has also contributed to the development of Apache Pinot and Presto at LinkedIn. Before joining LinkedIn, he was one of the original developers of Sherpa, a large scale key-value store at Yahoo!. His interests include large scale distributed systems and databases. Talk 2 Title: Optimizing Apache Spark SQL at LinkedIn Presenter: Fangshi Li Abstract: Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as: * Improving Dataset performance with automated column pruning * Bringing an efficient 2d join algorithm to Spark SQL * Fixing join skewness with adaptive execution * Enhancing the cost-optimizer with a history-based learning approach Bio: Fangshi Li is a software engineer at Linkedin. He has been working on Spark core infrastructure, user libraries, AI solutions, and Spark SQL engine optimizations. He was one of the original developers of Dr. Elephant, the performance tuning tool for Hadoop/Spark. Talk 3 Title: Open Source Reliability for Data Lake with Apache Spark Presenter: Michael Armbrust (https://databricks.com/speaker/michael-armbrust) Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover: * What data quality problems Delta helps address * How to convert your existing application to Delta Lake * How the Delta Lake transaction protocol works internally * The Delta Lake roadmap for the next few releases * How to get involved! Bio: Michael Armbrust is a committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and the Delta Lake open source project. He received his Ph.D. from UC Berkeley in 2013 and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage, and query optimization. NOTE: You may need a government-issued ID to enter the premises or the conference room.