AMP Update: Tachyon and Shark


Details
This meetup will present updates on two projects in the Spark stack: Tachyon and Shark. Tachyon is a new system, currently in a developer preview release, that allows fast, memory-speed data sharing between multiple instances of Spark, or of other parallel computing frameworks. It will simplify data sharing and isolation in Spark clusters, and speed up other tools using HDFS as well. On the Shark side, we will discuss soon-to-be-released work on having Shark efficiently use Tachyon, as well as other upcoming improvements.
This meetup will be hosted on the Google campus by ClearStory Data and their investor Google Ventures. We thank both ClearStory and Google for offering to host the event! Doors will open at 6:30, with talks starting at 7. Dinner will be provided.
Tachyon: Reliable File Sharing at Memory-Speed Across Cluster Frameworks (presented by Haoyuan Li and Ali Ghodsi)
Tachyon ( http://tachyon-project.org/ ) is a distributed file system enabling reliable data sharing at memory-speed across cluster frameworks, such as Spark and Hadoop. It aggressively uses memory to get high throughput. Furthermore, by leveraging lineage information, Tachyon can avoid replication altogether, allowing it to remain fault-tolerant while retaining high throughput. In this talk, we will present Tachyon's developer preview release, in which it caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. This enables different Spark jobs to share in-memory data across JVMs.
Shark Update and Upcoming Changes (presented by Reynold Xin)
Shark has seen several key changes in the past few months. One of the major ones is a new storage format to support efficiently reading data from Tachyon, which enables data sharing and isolation across instances of Shark. In addition, we're making several optimizations to both Shark and Spark that promise significant performance boosts.

AMP Update: Tachyon and Shark