Next Generation Open Source Data Infra (Iceberg, Spark and databases)
Details
Important Note: It is required to register for the event (free) on ti.to, before the event. You will then be sent an eNDA which needs to be signed 24 hours before the event, for security reasons. A badge would be pre-printed for you when you arrive at the event. Please register here (https://ti.to/big-data/data-infrastructure/with/dbud9l-da7a). If for some reason you are not able to sign the eNDA online, you can still attend, however you may have a wait in a long line at the sign in desk.
Talk #1: Introducing Iceberg, Tables designed for object stores
This talk will focus on Iceberg, a new table metadata format that is designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes and fixing longstanding problems like reliable schema evolution. This talk will include an overview of how Iceberg works and details about how Netflix is using Iceberg to make big data easier and more reliable.
Speaker Bio:
Ryan Blue works on Netflix's big data platform team. He contributes to Apache Spark and is a PMC member of Apache Parquet and Apache Avro.
Talk #2: Scaling Apache Spark Usage at Lyft
In this talk, Li will talk about current Apache Spark usages at Lyft and how Lyft scales current usage of Apache Spark for machine learning and etl-type of workloads through managed multi-cluster model. In this talk we will also show how we operate Apache Spark with autoscaling and high availability support. In this talk we will also show how Spark coexists with our Apache Hive and other data infrastructure services as a portfolio offered to a wide range of customers.
Speaker Bio:
Li Gao is the tech lead in the Apache Spark domain in Data Infrastructure org at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.
Talk #3: From flat files to deconstructed database: The evolution and future of the big data ecosystem
In this talk, Julien discusses the key open source components of the big data ecosystem—including Apache Calcite, Parquet, Arrow, Avro, and Kafka as well as batch and streaming systems—and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. (Parquet is the columnar data layout to optimize data at rest for querying. Arrow is the in-memory representation for maximum throughput execution and overhead-free data exchange. Calcite is the optimizer to make the most of our infrastructure capabilities.) Julien also explores the emerging components that are still missing or haven’t become standard yet to fully materialize the transformation to an extremely flexible database that lets you innovate with your data.
Speaker Bio:
Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Julien is a principal engineer at WeWork where he works on the data platform architecture. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

