Past Meetup

User Stories: Alluxio production use cases with Presto and Hive

This Meetup is past

25 people went

Location image of event venue

Details

Reactive New York is co-organizing a meetup event with Alluxio: a memory speed virtual distributed storage system meetup.

Alluxio, formerly Tachyon, originally developed by AMPLab at UC Berkeley, enables any application to interact with any data from any storage system at memory speed. It is an open source memory speed virtual Distributed Storage that integrates well with Spark, Flink, or Hadoop. This meetup will feature talks from Alluxio, JD.com and Bazaarvoice.

Title: Alluxio: An Overview and What's New in 1.8

Abstract:
Alluxio is a memory-speed virtual distributed file system that provides big-data analytics stack a unified data access layer. Alluxio as this new layer enables compute frameworks like Spark, Presto, MapReduce, Hive and etc to transparently access different persistence storage system while actively leveraging memory to accelerate data access. As a result, Alluxio helps simplify the development and management of big data and machine learning workloads with lower cost and better performance. Alluxio originated from “Tachyon”, a research project of AMPLab at UC Berkeley. Currently, the project has more than 800 contributors from more than 100 companies or organizations worldwide.
In this talk, Haoyuan and Bin will give an overview of Alluxio in its basic concepts, architecture, data flow and how to interact with other components of the ecosystem. They will also share production use cases. Then they will cover the new features in the latest 1.8 release and our roadmap for future versions.

Title: Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks

Abstract:
JD.com is China’s largest online retailer and its biggest overall retailer, as well as the country’s biggest internet company by revenue. Currently, JD.com’s BDP platform runs more than 400,000 jobs (15+ PB) daily, on a system with more than 15,000 cluster’s nodes and a total capacity of 210 PB. Alluxio has run in JD.com’s production environment on 100 nodes for six months. Tao and Bing will explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component achieving 10x performance improvement on average with JDPresto. This work has also extended Alluxio and enhanced the syncing between Alluxio and HDFS for consistency.

Title: Hybrid collaborative tiered-storage with Alluxio

Abstract:
Systems that deal with AWS S3 often come with a negative performance impact. There's no co-location and the data has to move through slower, often congested wire networks. Alluxio can provide a caching layer for the data, however there's still the question of how and when to move which data. Should all the data by default be cached or should they be cached when used? In this talk, I will explore that gray area in between where the users and the dataset publishers will collaborate to decide what and how the data is cache in a tiered-storage architecture to maximize performance and minimize operating costs.