Memory is the key to fast big data processing. This has been realized by many, and frameworks, such as Spark and Shark, already leverage memory performance. With these advancement, big data storage is becoming a critical bottleneck in many workloads.
In this talk, we introduce Tachyon, a memory centric fault-tolerant distributed file system, which enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. Tachyon achieves memory-speed and fault-tolerance by using memory aggressively and leveraging lineage information. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.
Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change. The project is open source and is deployed at multiple companies. It has more than 40 contributors from over 10 institutions, including Yahoo, Intel, Redhat, Alibaba etc. The project is also part of Fedora distribution.
Haoyuan Li is a Computer Science Ph.D. candidate in the AMPLab at UC Berkeley, working with Prof. Scott Shenker and Prof. Ion Stoica on big data and cloud computing. He leads Tachyon, an open source memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks. Before Berkeley, he worked at Conviva and Google, and studied at Cornell University and Peking University.