Deep Dive into Spark Tuning


Details
In this meetup, we will be focusing on tuning spark in various workloads.
We have limited seats available. So only RSVP if you are certain about attending.
Agenda:
10:00 AM - 11:15 PM - Understanding scalability limits of spark applications by Rohit Karlupia from Qubole.
General approach to spark application tuning is simply trial and error. This takes time and sometimes lots of it, while wasting compute cycles. Moreover, it doesn’t tell us where to looks for further improvements or even if we have reached the performance limits of a given application. The talk will introduce a standard methodology for approaching tuning of spark applications. Spark applications can be slow for various reasons, the most common of them being a badly written/configured application. Another class of reasons which results in slower applications is constraints of the hardware, configuration of auxiliary services and even nuances of cloud providers.
Given a single run of an application, we will try to answer:
- Will this application run faster with more cores? How fast?2) Can we save compute cost by running it with less cores without much increase in wall clock time?3) What is the absolute minimum time this application will take even if we give it infinite computing power?4) How far is the given application from "ideal spark application"? What is the best runtime we can get if we were to make this application an "ideal spark application"?
We will go into the theory behind answering these questions, assumptions and limits of the theory and where to look for further improvements.
11:30 AM - 12:45 PM - DR. ELEPHANT: ACHIEVING QUICKER, EASIER, AND COST-EFFECTIVE ANALYTICS IN SPARK by Akshay Rai from LinkedIn
Is your job running slower than usual? Do you want to make sense from the thousands of Hadoop & Spark metrics? Do you want to monitor the performance of your flow, get alerts and auto tune them? These are the common questions every Hadoop user asks but there is not a single solution that addresses it. We at Linkedin faced lots of such issues and have built a simple self-serve tool for the hadoop users called Dr. Elephant. Dr. Elephant, which is already open sourced, is a performance monitoring and tuning tool for Hadoop and Spark. It tries to improve the developer productivity and cluster efficiency by making it easier to tune jobs. Since its open source, it has been adopted by multiple organizations and followed with a lot of interest in the Hadoop and Spark community. In this talk, we will discuss about Dr. Elephant and outline our efforts to expand the scope of Dr. Elephant to be a comprehensive monitoring, debugging and tuning tool for Hadoop and Spark applications. We will talk about how Dr. Elephant performs exception analysis, give clear and specific suggestions on tuning, tracking metrics and monitoring their historical trends.
Open Source: https://github.com/linkedin/dr-elephant
Kindly bring a photo ID for security check.
For any queries reach out to
Gaurav - 9535322220
Shashank - 9036731090
Madhukara - 9686409878

Deep Dive into Spark Tuning