Spark User Meetup featuring Hadoop, Mesos, Debugging


Details
The Spark User Meetup is a chance to interact with developers and users of Spark (http://www.spark-project.org), the Scala-based high-speed cluster computing framework, and talk about big data processing in general. This third meetup will be held on the UC Berkeley campus at 7:30 PM, with pizza available starting at 7 PM. There will be two talks:
Running Spark and Hadoop on a Private Cluster with Mesos
(Benjamin Hindman, UC Berkeley and Twitter)
This talk will cover how to deploy Spark to a cluster using the Apache Mesos (http://incubator.apache.org/mesos) cluster manager, and dynamically share resources with Hadoop MapReduce by running Hadoop through Mesos as well. It will focus on the upcoming 0.9 release of Mesos, which provides a variety of usability and fault tolerance fixes. We will demo how to set up and configure a cluster with Mesos, Spark, Hadoop MapReduce and HDFS starting from plain Linux machines. In addition, we'll cover practical issues such as how to find log files and debug your jobs.
Arthur: The Spark Debugger
(Ankur Dave, UC Berkeley)
Debugging large parallel jobs is hard, because the sheer scale of the computation makes it hard to track what's happening, inevitable weirdnesses in the data triggers errors, and it's difficult tell whether a program is performing efficiently. To tackle this problem, we are designing Arthur, a debugger for Spark programs that provides visibility into the computation and powerful analysis features. One key feature of Arthur is that it can leverage the deterministic nature of Spark programs to efficiently replay part of a parallel job. Using this capability, users can rerun any task in the job in a single-process debugger to step through it line by line, or rebuild any intermediate dataset in the job and query it interactively from the Spark shell. We are also using replay to build tracing capabilities, such as figure out which input records caused an output record. This talk will give an overview of the research going on in Arthur and cover several features that are already included in Spark. We also solicit your suggestions for improving debugging!
We'll meet in the Banatao Auditorium (room 310) of Sutardja Dai Hall on the Berkeley campus ( http://g.co/maps/wr2qy ). It's about a 15-minute walk from the Downtown Berkeley BART.

Spark User Meetup featuring Hadoop, Mesos, Debugging