Spark is an open source cluster computing framework that can outperform Hadoop by 30x by storing datasets in memory across jobs. Shark is a port of Apache Hive onto Spark, which provides a similar speedup for SQL queries, allowing interactive exploration of data in existing Hive warehouses. In this meetup, we'll go into detail on the implementation of Shark, and also show how to get started with a first alpha release.
The meetup will be hosted at Palantir Technologies in Palo Alto. Food will be available at 6:30, with talks starting at 7 PM.
More Details on Shark
We have ported Apache Hive, the large-scale Hadoop data warehouse solution, to run queries on Spark. The resulting system, Shark (Hive on Spark), can answer Hive QL queries 30 times faster than Hive without modification to the existing data. It is backward-compatible with the Hive QL language, metastore, and user-defined functions. We will cover the architecture and implementation of Shark, including our additions to Hive QL that allow users to cache data in memory, and a new column-oriented format we have designed for storing Hive data efficiently in memory on the JVM as arrays of primitive types.
Additionally, we will discuss our ongoing work on integrating SQL processing with machine learning, which we see as a natural future direction for Shark due to Spark's inherent efficiency at iterative algorithms. In Shark, we allow users to express their machine learning algorithms as Scala-based "distributed UDFs", which then run in the same execution engine as the SQL query processor. This enables much more efficient data pipelines, and provides a unified system for data analysis using both SQL and sophisticated statistical learning functions.
These topics will be presented by Reynold Xin, Cliff Engle and Antonio Lupher, the Berkeley research team behind Shark.