Skip to content

Near Real-time Big Data Platform w/ Spark & Alluxio - Vipshop eCommerce Use Case

Photo of
Hosted By
Calvin J. and 2 others
Near Real-time Big Data Platform w/ Spark & Alluxio - Vipshop eCommerce Use Case
6:30pm - Happy Hour and networking
7:00pm - Alluxio 2.0 Preview Release Deep Dive - Calvin Jia
7:30pm - Real-time Data Processing for Sales Attribution Analysis with Alluxio, Spark and Hive at VIPShop - Wanchun Wang - Chief Architect
7:45pm - Q&A

Event partner: AICamp

Talk 1:
Title: Alluxio 2.0 Preview Release Deep Dive

We are excited to present Alluxio 2.0 to our community. The goal of Alluxio 2.0 was to significantly enhance data accessibility with improved APIs, expand use cases supported to include active workloads as well as better metadata management and availability to support hyperscale deployments. Alluxio 2.0 Preview Release is the first major milestone on this path to Alluxio 2.0 and includes many new features.

In this talk, I will give an overview of the motivations and design decisions behind the major changes in the Alluxio 2.0 release. We will touch on the key features:
- New off-Heap metadata storage leveraging embedded RocksDB to scale up Alluxio to handle a billion files;
- Improved Alluxio POSIX API to support legacy and machine-learning workloads;
- A fully contained, distributed embedded journal system based on RAFT consensus algorithm in high availability mode;
- A lightweight distributed compute framework called “Alluxio Job Service” to support Alluxio operations such as active replication, and async-persist, cross mount move/copy and distributed loading;
- Support for mounting and connecting to any number of HDFS clusters of different versions at the same time;
Active file system sync between Alluxio and HDFS as under storage.

Calvin Jia is the top contributor of the Alluxio project. He has been involved as a core maintainer and release manager since the early days when the project was known as Tachyon. Calvin has a B.S. from the University of California, Berkeley.

Title: Real-time Data Processing for Sales Attribution Analysis with Alluxio, Spark and Hive at VIPShop

Vipshop is a leading eCommerce company in China with over 15 million active daily users. Our ETL jobs primarily run against data on HDFS, which can no longer meet the increasing swiftness and stability demand for certain real-time jobs. In this talk, I will explain how we’ve replaced HDFS with Memory+ HDD managed by Alluxio to speed up data accesses for all our Sales Attribution applications running on Spark and Hive, this system has been in production for more than 2 years. As more old fashion ETL SQLs are being converted into real-time jobs, leveraging Alluxio for caching has become one of the widely considered performance tuning solution. I will share our criteria when selecting use cases that can effectively get a boost by switching to Alluxio.

Our future work includes using Alluxio as an abstraction layer for the \tmp\ directory in our main Hadoop clusters, and we are also considering Alluxio to cache the hot data in our 600+ node Presto clusters.

Wanchun Wang is the Chief Architect and has been with VIPShop for over 5 years and his interests focus on processing large amounts of data such as building streaming pipelines, optimizing ETL applications, and designing in-house ML & DL platforms. He is currently managing big data teams that are responsible for batch, real-time, and data warehouse systems.

Our event partner AICamp ( is a global online platform for engineers, data scientists to learn and practice AI, ML, DL, Data Science, with 80000+ developers, and 40+ cities local study groups around the world.
Alluxio Bay Area Meetup
1825 S Grant St
1825 S Grant St · San Mateo, CA
How to find us

First floor training room in building 1825.

Google map of the user's next upcoming event's location