Next Meetup

Bay Area Apache Spark & Women in Big Data @ Databricks HQ, SF
Hosted and moderated by Maddie Schults ( from Databricks (, please join us for an evening of Bay Area Apache Spark and WiBD ( Meetup featuring tech-talks from women in engineering. Thanks to Databricks ( for hosting and sponsoring this meetup. Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions 6:40 - 7:15 pm Holden Karau ( Bringing a Jewel (as a starter) from the Python world to the JVM with Apache Spark, Arrow, and Spacy 7:15 - 7:50 pm Anya Bida ( Tech Talk 2 7:50 - 8:25 pm Edwina Lu ( Metrics-Driven Tuning of Apache Spark at Scale 8:25 - 8:45 pm More Mingling & Networking Tech-Talk 1: Details Coming Soon Abstract: With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?” Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so lets learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP. Bio: Holden Karau ( Tech-Talk 2: Details Coming Soon Bio: Anya Bida ( Abstract: Tech-Talk 3: Metrics-Driven Tuning of Apache Spark at Scale. Abstract: Tuning Apache Spark can be complex and difficult, since there are many different configuration parameters and metrics. As the Spark applications running on LinkedIn’s clusters become more diverse and numerous, it is no longer feasible for a small team of Spark experts to help individual users debug and tune their Spark applications. Users need to be able to get advice quickly and iterate on their development, and any problems need to be caught promptly to keep the cluster healthy. In order to achieve this, we automated the process of identifying performance issues and providing custom tuning advice to users and made improvements for scaling to handle thousands of Spark applications per day. We encountered was a lack of proper metrics related to Spark application performance. We will present new metrics added to Spark which can precisely report resource usage during runtime, and discuss how these are used in heuristics to identify problems. Based on this analysis, custom recommendations are provided to help users tune their applications. We will also show the impact provided by these tuning recommendations, including improvements in application performance itself and the overall cluster utilization. Bio: Edwina Lu is a software engineer on LinkedIn's Hadoop infrastructure development team, currently focused on supporting Spark on the company's clusters. Previously, she worked at Oracle on database replication. Edwina holds a Master's degree in Computer Science from Stanford University.

Databricks, Inc HQ

160 Spear St, Floor 13 · San Francisco, ca

    Past Meetups (88)

    What we're about

    Public Group

    This is a meetup for Bay Area users of Spark ( ), the high-speed cluster programming framework. We rotate among locations in San Francisco and Silicon Valley. We also discuss other Spark-related projects, including Spark SQL, MLlib, GraphX and Spark Streaming. The meetup includes introductions to the various Spark features, case studies from users, best practices for deployment and tuning, and updates on development.

    Members (9,709)

    Photos (106)