Unify Data Analytics: Any Stack Any Cloud


Details
Alluxio (www.alluxio.io) is an open-source virtual distributed file system that provides a unified data access layer for hybrid and multi-cloud deployments. It enables distributed compute engines like Spark, Presto or Machine Learning frameworks like TensorFlow to transparently access different persistent storage systems (including HDFS, S3, Azure and etc) while actively leveraging in-memory cache to accelerate data access. Developed originally from UC Berkeley AMPLab as a research project "Tachyon", Alluxio has more than 1200 contributors and is used by over 100 companies worldwide with the largest production deployment of over 1000 nodes.
This presentation focuses on how Alluxio helps the big data analytics stack to be cloud-native. The trending Cloud object storage systems provide more cost-effective and scalable storage solutions but also different semantics and performance implications compared to HDFS. Applications like Spark or Presto will not benefit from the node-level locality or cross-job caching when retrieving data from the cloud object storage. Deploying Alluxio to access the cloud solves these problems because data will be retrieved and cached in Alluxio instead of the underlying cloud or object storage repeatedly.
In this talk, the speaker will present
- New trends and challenges in the data ecosystem in the cloud era
- Basic concepts in Alluxio;
- Production use cases of Alluxio;
Speaker Bio:
-
Bin Fan is a founding member and VP of open source at Alluxio. He's also the PMC maintainer and PMC Chair of the Alluxio open source project. Prior to joining Alluxio as a founding engineer, he worked for Google to build the next-generation storage infrastructure. Bin received his Ph.D. in computer science from Carnegie Mellon University on the design and implementation of distributed systems.
-
Dr. Shouwei Chen is a software engineer at Alluxio. Before joining Alluxio, Shouwei received a Ph.D. degree from Rutgers University. Shouwei’s research focuses on the codesign of memory-centric computing frameworks with in-memory distributed file systems in large-scale environments.

Unify Data Analytics: Any Stack Any Cloud