Speakers: Mayuresh Kunjir (http://www.cs.duke.edu/people/graduate/?csid=0004030) and Harold Lim (http://www.cs.duke.edu/people/graduate/?csid=0002030), Duke University
Spark is an open-source cluster-computing system developed by the AMPLab at the University of California, Berkeley. Spark provides very fast performance and ease of development for a variety of data analytics needs such as machine learning, graph processing, and SQL-like queries. Spark supports distributed in-memory computations that can be up to 100x faster than Hadoop.
Shark is a Hive-compatible data warehousing system built on Spark. Shark supports the HiveQL query language, the Hive Metastore, and all the serialization formats supported by Hive. The use of Spark and a number of built-in optimizations make Shark perform up to 100x faster than Hive.
This talk will discuss the internals of Spark and Shark, the applications that these systems support, and show a demo that includes performance comparisons with Hive.