Toronto Apache Spark #17


Details
Debugging Spark: A big data monster identification guide and where to find them
Agenda:
6:30PM to 7:00PM - Opening and networking (Refreshments provided)
7:00PM to 8:30PM - Debugging Spark: A big data monster identification guide and where to find them by Holden Karau
8:30PM to 9:00PM - Networking
------------------------------------------------------------
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden will explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. We will demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but we will examine how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden cover how to quickly use the UI to figure out if certain types of issues are occurring in our job.
Target audience:
Data Scientist, Data Engineer, Dev Ops, Anyone who gets stuck debugging a Spark job.
Level: Intermediate - Technical
Speaker: Holden Karau (http://linkedin.com/in/holdenkarau)
Principal Software Engineer at IBM's Spark Technology Center
Holden Karau is transgender Canadian, and an active open source contributor. When not in San Francisco working as a software development engineer at IBM's Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. Holden is a co-author of numerous books on Spark including High Performance Spark (which she believes is the gift of the season for those with expense accounts) & Learning Spark. Holden is a Spark committer, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, scooters, poutine, and dancing.
------------------------------------------------------------
Sponsors:
http://photos1.meetupstatic.com/photos/event/c/9/1/9/600_458391481.jpeg
Special Thanks to:
Ashwin Tumne (https://www.linkedin.com/in/tumne)

Toronto Apache Spark #17