Over the past two decades, the Big Data stack has reshaped and evolved quickly with numerous innovations driven by the rise of many different open source projects and communities. In this meetup, speakers from Uber, Alibaba, and Alluxio will share best practices for addressing the challenges and opportunities in the developing data architectures using new and emerging open source building blocks. Topics include data format (ORC) optimization, storage security (HDFS), data format (Parquet) layers, and unified data access (Alluxio) layers.
5:30PM - Doors Open
6:00PM - Introduction
6:15PM - Uber: Towards Finer-grained Access Control in Hadoop
6:35PM - Alluxio: Building a Distributed Data Access Layer for Big Data Analytics on Any Cloud
6:55PM - Alibaba: Columnar Storage of Massive Scale at Alibaba
7:15PM - Q&A and Networking
Please fill out our Google Survey after RSVPing.
More info on the talks:
Uber: Towards finer-grained access control in Hadoop [Xinli Shang, Guang Yang]
Fine grained access control at the column and partition level for data stored in Hadoop is an industry requirement. In this talk, we will present the basics of the two finer grained access control mechanisms : Hadoop column level and partition level access control and use cases. We will also describe the initial integration of this technology with Apache Spark, Presto and Hudi frameworks, along with scalability and performance implications in analytic workloads.
Alluxio: Building a Distributed Data Access Layer for Big Data Analytics on Any Cloud [Bin Fan]
The rise of computation-intensive workloads and the adoption of the cloud has driven organizations to adopt a decoupled architecture for modern workloads -- one in which compute scales independently from storage.
Alluxio is an open source distributed file system that which sits between compute and storage layer that allows you to realize the benefits of decoupled architecture with the same performance, better data and metadata locality and lower costs. In this talk, we will discuss the design of Alluxio, its architecture, and use cases. We will dive into the choices in its design space and share the experiences when implementing data tiering, storage options, and cache eviction policies.
Alibaba: Columnar Storage of Massive Scale at Alibaba [Gang Wu]
Alibaba is one of the largest e-commerce company in China and it also specializes in Internet, retail, logistics, payment, entertainment and cloud computing. To support accumulated ExaBytes of data with PetaBytes of data generated daily at Alibaba, MaxCompute -- our in-house large-scale data processing platform -- accounts for more than 99% of all offline data across the company. The columnar storage engine of MaxCompute is built on Apache ORC, which has a proven success by Hadoop, Hive, Spark, Presto, etc. To support large-scale computing and storage in MaxCompute, various optimization have been made on ORC. In this talk, we will discuss what we have achieved by leveraging ORC and also shed light on major optimization that powers our use cases.