• Bay Area Apache Spark Meetup @ Unravel Data in SF

    Microsoft Reactor

    Let's kick off the New Year 2019 with our first BASM Meetup! Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark at scale from Unravel Data and Databricks. Agenda: 6:30 - 7:00 pm: Social Hour with Food, Drinks, Beer & Wine 7:00 - 7:05 pm: Jules Introduction & Announcements 7:05 - 7:50 pm: Tech Talk from Unravel Data 8:00 - 8:45 pm: Tech Talk from Databricks 8:45 - 9:00 pm: Unravel Raffle and Additional Networking, Q&A Tech Talk 1: Putting AI to Work on Apache Spark Presenter: Shivnath Babu Abstract: Apache Spark simplifies AI, but why not use AI to simplify Spark performance and operations management? An AI-driven approach can drastically reduce the time Spark application developers and operations teams spend troubleshooting problems. This talk will discuss algorithms that run real-time streaming pipelines as well as build ML models in batch to enable Spark users to automatically solve problems like: (i) fixing a failed Spark application, (ii) auto tuning SLA-bound Spark streaming pipelines, (iii) identifying the best broadcast joins and caching for SparkSQL queries and tables, (iv) picking cost-effective machine types and container sizes to run Spark workloads on the AWS, Azure, and Google cloud; and more. Bio: CTO and Co-Founder at Unravel Data Systems and an adjunct professor of computer science at Duke University. Shivnath co-founded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. Tech Talk 2: Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark Presenter: Lu Wang Abstract: Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference. In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning. We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload. Bio: Lu Wang is a software engineer at Databricks. His main research interests are developing high-performance parallel algorithms for scientific computing and applications. He was actively involved in the development of the Project Hydrogen, Spark Deep Learning pipelines, and Spark MLlib since he joined Databricks. Before Databricks, he was working on parallel multigrid linear solvers on exascale parallel machines for solving the linear systems from reservoir simulations at Lawrence Livermore national laboratory. He received his Ph.D. from the Pennsylvania State University in 2014.

    7
  • Bay Area Apache Spark Meetup @ Adobe in San Jose, CA

    Adobe Systems Inc.

    Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark from Adobe (https://www.adobe.com/) and Apache Spark Committer from Databricks (https://databricks.com). Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions (Jules Damji + Adobe) 6:40 - 7:20 pm Apache Spark at Adobe 7:20 - 8:00 pm Upcoming Apache Spark 2.4: What’s New & Why Should You Care 8:00 - 8:30 pm More Mingling & Networking Tech-Talk 1: Apache Spark at Adobe Abstract: The Adobe Cloud Platform is a multi-tenant, big data stack as a service on the cloud which provides the modern foundation for all the various parts of the Adobe Experience Cloud. From a data processing perspective, Adobe uses Apache Spark in a variety of scenarios. We will talk about the high-level data architecture, briefly touching on the infrastructure/scale/challenges, and lastly, we will cover how we are leveraging Spark. As part of the Cloud Platform, we have also built a Query Engine leveraging Spark SQL for ad-hoc data querying. The Query Engine has implemented a PostgreSQL protocol and leverages Akka Streams and the Presto Parser as an abstraction layer around Spark SQL. We will talk about the high-level architecture and talk about the various patches made to Spark SQL such as support for nested column pruning that are critical to our performance needs when accessing data with thousands of nested columns. Bio: Yogesh Natarajan is a senior software engineer in the Data Platform group at Adobe. His interests include building server-side web applications and big data technologies. He has previously worked at Chegg, Yahoo and graduated with a masters from UC Irvine Andrew is a senior software engineer in the Data Platform group at Adobe. He specializes in building modern, scalable, cloud-based Scala applications. Tech-Talk 2: Upcoming Apache Spark 2.4: What’s New & Why Should You Care Abstract: The upcoming Apache Spark 2.4 release is the fifth release in the 2.x series. This talk will provide an overview of the major features and enhancements in this upcoming release. * A new scheduling model (Barrier Scheduling) to enable users to properly embed distributed Deep Learning training as a Spark stage to simplify the distributed training workflow. * 35 high-order functions are added for manipulating arrays/maps in Spark SQL. * A new native AVRO data source, based on Databricks' spark-avro module. * PySpark also introduces eager evaluation mode on all operations for teaching and debuggability. * Spark on K8S adds PySpark and R support and client-mode support. * Various enhancements in structured streaming. e.g., stateful operators in continuous processing. * Various performance improvement in built-in data sources. e.g., Parquet nested schema pruning. Bio: Xiao Li is a software engineer at Databricks. His main interests are in Spark SQL, data replication, and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication. He received his Ph.D. from the University of Florida in 2011. He is a Spark committer/PMC PARKING: All visitors attending the Adobe/Spark Meetup on ET 01 Park Conference room will need to park in the East Tower basement level 1.

    7
  • Bay Area Apache Spark Meetup @ Databricks, HQ in San Francisco

    Join us for an evening of Bay Area Apache Spark Meetup featuring open-source tech-talks about using and innovating with Apache Spark from Databricks (https://databricks.com). Thanks to Databricks for hosting and sponsoring this meetup. Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements & introductions (Jules S. Damji + Reynold Xin) 6:40 - 7:25 pm Tech-Talk-1 Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark 7:25 - 8:10 pm Tech-Talk-2 MLflow: Infrastructure for a Complete Machine Learning Life Cycle 8:10 - 8:30 pm Mingling & Networking Tech-Talk 1: Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark Abstract: Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models AI has always been on of the most exciting applications of big data and Apache Spark. Increasingly Spark users want to integrate Spark with distributed deep learning and machine learning frameworks built for state-of-the-art training. On the other side, increasingly DL/AI users want to handle large and complex data scenarios needed for their production pipelines. This talk introduces a new project that substantially improves the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark. We will introduce the major directions and provide progress updates, including 1) barrier execution mode for distributed DL training, 2) fast data exchange between Spark and DL frameworks, and 3) accelerator-awareness scheduling. Bio: Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems. Tech-Talk 2: MLflow: Infrastructure for a Complete Machine Learning Life Cycle Abstract: ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size. Bio: Members from the MLflow Team (https://www.mlflow.org/)

    8
  • Spark + AI Summit: Bay Area Apache Spark Meetup @ Moscone Center, SF

    Moscone center, San Francisco, CA

    Moscone Center Room 2014 Join us for an evening of Bay Area Apache Spark Meetup at the Spark + AI Summit (https://databricks.com/sparkaisummit/north-america)featuring tech-talks from Databricks (https://databricks.com/), Uber, (https://www.uber.com/) and Stanford University (https://www.stanford.edu/). Thanks to Databricks for hosting and sponsoring this meetup. (Note: This meetup is open to everyone. You don’t have to be registered for Spark + AI Summit.) Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Opening Remarks & Introductions, Jules Damji, Databricks 6:40 - 7:20 pm Tech Talk-1: Richard Garris, Databricks 7:20 - 8:00 pm Tech Talk-2: Alexander Sergeev, Uber 8:00 - 8:05 pm Short Break 8:05 - 8:45 pm Tech Talk-3: Peter Kraft, Stanford University 8:45 - 9:00 pm More Mingling & Networking Tech-Talk 1: Understanding Parallelization of Machine Learning Algorithms in Apache Spark™ Abstract: Machine Learning (ML) is a subset of Artificial Intelligence (AI). In this talk, Richard Garris, Principal Architect at Databricks will explain how various ML algorithms are parallelized in Apache Spark. Andrew Ng calls the algorithms the "rocket ship" and the data "the fuel that you feed machine learning" to build deep learning applications. We will start with an understanding of machine learning pipelines built using single machine algorithms including Pandas, scikit-learn, and R. Then we will discuss how Apache Spark MLlib can be used to parallelize your machine learning pipeline with Linear Regression and Random Forest. Lastly, we will discuss ways to parallelize single machine algorithms in Spark by broadcasting the data and then performing distributed feature selection, model creation or hyperparameter tuning. Bio: Richard Garris is a Principal Solutions Architect at Databricks focused on helping clients with their Advanced Analytics initiatives using Apache Spark and MLlib. He has spent 13 years working with enterprises in data management and analytics. Richard got his undergraduate degree at The Ohio State University and Masters in Software Management from CMU. His previous work experience includes Skytree, Google, and PwC. Tech-Talk 2: Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow Abstract: Horovod makes it easy to train single-GPU TensorFlow model on many GPUs - both on a single server and across multiple servers. The talk will touch upon mechanisms of deep learning training, challenges that distributed deep learning poses, mechanics of Horovod, as well as practical steps necessary to train a deep learning model on your favorite cluster. Bio: Alex Sergeev is a Deep Learning Infrastructure Engineer at Uber working on scalable Deep Learning. He received his MS. degree in Computer Science from Moscow Engineering Physics Institute. Before joining Uber, he was a Senior Software Engineer at Microsoft working on Big Data Mining. Tech-Talk 3: Apache Spark™ and MacroBase Abstract: In this talk, we present MacroBase, an analytics system we have built at Stanford University that uses Apache Spark to prioritize human attention via large-scale feature selection. In a world swamped with enormous datasets and an enormous variety of complex tools to analyze them, MacroBase specializes in one task: finding and explaining unusual or interesting trends on data as easily as possible. Specifically, it searches for correlations in large-scale datasets. For example, an app developer wondering why their app was crashing could ask MacroBase to find factors in their logs that correlate with crash behavior and explain the crashes. Alternatively, an analyst looking for trends in time series data could ask MacroBase to find changes over time. MacroBase relies on Spark and Spark-SQL to provide fast and easy-to-use analytics. Users operate MacroBase using MacroBase-SQL, an extension of SQL that introduces new operators to partition datasets and find explanations on partitions. MacroBase-SQL is built on top of Spark-SQL, with its new operators taking in and returning Spark dataframes. This means that MacroBase is fully distributed and can easily be integrated into any system already running Spark or Spark-SQL. In this talk, we will explain how we built MacroBase new operators on top of Spark and what you can do with them. Bio: Peter Kraft is a first-year graduate student at Stanford advised by Peter Bailis and Matei Zaharia. He is interested in solving problems at the intersection of systems and machine learning and in building more usable and powerful machine learning systems.

    25
  • BASM @ Bloomberg in San Francisco

    Bloomberg

    Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks using Apache Spark from Bloomberg (https://www.bloomberg.com/) and Databricks (https://databricks.com/). Thanks to Bloomberg (https://www.bloomberg.com/)for hosting and sponsoring this meetup. Bloomberg Security Building requires that you must fill out this form if you RSVP: https://goo.gl/forms/wjeDeg6HPLAeeXIC3 Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:35 pm Welcome opening remarks, announcements, acknowledgments, and introductions 6:35 - 7:15 pm Bloomberg Ilan Filonenko: Apache Spark on K8s and HDFS Security 7:20 - 8:00 pm Databricks Jules S. Damji: What’s New in Apache Spark 2.3 and Why Should You Care 8:00 - 8:30 pm Mingling Tech-Talk 1: Apache Spark on K8s and HDFS Security Abstract: There is growing interest in running Apache Spark natively on Kubernetes. lan Filonenko will explain the design idioms, architecture and internal mechanics of Spark orchestrations over Kubernetes. Since data for Spark analytics is often stored in HDFS, Ilan will also explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as data locality and security through the use of Kubernetes constructs such as secrets and RBAC rules Bio: Ilan Filonenko, Software Engineer, Bloomberg Ilan Filonenko is a member of the Data Science Infrastructure team at Bloomberg, where he has designed and implemented distributed systems at both the application and infrastructure level. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan’s research has focused on algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on optimizing stochastic algorithms such as stochastic gradient descent (SGD). Tech-Talk 2: What’s New in Apache Spark 2.3 and Why Should You Care Abstract: The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support. This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements: * Continuous Processing in Structured Streaming. * PySpark support for vectorization, giving Python developers the ability to run native Python code fast. * Native Kubernetes support, marrying the best of container orchestration and distributed data processing. Bio: Jules S. Damji is an Apache Spark Community and Developer Advocate at Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, @Home, LoudCloud/Opsware, VeriSign, ProQuest, and Hortonworks, building large-scale distributed systems. He holds a B.Sc and M.Sc in Computer Science and MA in Political Advocacy and Communication from Oregon State University, Cal State, and Johns Hopkins University respectively.

    10
  • Bay Area Apache Spark & Women in Big Data @ Databricks HQ, SF

    Hosted and moderated by Maddie Schults (https://www.linkedin.com/in/maddieschults/) from Databricks (https://databricks.com/), please join us for an evening of Bay Area Apache Spark and WiBD (https://www.womeninbigdata.org/blog/) Meetup featuring tech-talks from women in engineering. Thanks to Databricks (https://databricks.com) for hosting and sponsoring this meetup. Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions 6:40 - 7:15 pm Holden Karau: Bringing a Jewel (as a starter) from the Python world to the JVM with Apache Spark, Arrow, and Spacy 7:15 - 7:50 pm Anya Bida: Just enough DevOps for Data Scientists (Part II) 7:50 - 8:25 pm Shan He: Creating Beautiful and Meaningful Visualizations with Big Data 8:25 - 8:45 pm More Mingling & Networking Tech-Talk 1: Details Coming Soon Abstract: With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?” Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so let's learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP. Bio: Holden Karau (https://www.linkedin.com/in/holdenkarau/) Tech-Talk 2: Just enough DevOps for Data Scientists (Part II) Abstract: Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully. Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018 https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s Bio: Anya Bida (https://www.linkedin.com/in/anyabida/) Abstract: Tech-Talk 3: Creating Beautiful and Meaningful Visualizations with Big Data Abstract: At Uber, location data is our biggest asset. How do we create data visualizations with rich location data, render a million points of events in the blink of an eye, and, most importantly, derive insights from them? In this presentation, you'll get a behind the scenes look at the tools and data visualizations we use at Uber to inform business decisions. I will walk us through an overview of the data visualization process with a case study, discuss how and why we built our own visualization tool to visualize location data in a more meaningful way. I will also show that you can create beautiful visualizations, but in order for them to be useful, you have to understand the information you are designing. Bio: https://www.linkedin.com/in/shan-he-25400b16/

    4
  • Bay Area Apache Spark Meetup @ Workday in San Mateo

    Happy New Year! Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks on Apache Spark (https://spark.apache.org/) at scale from Workday (https://www.workday.com/) and Databricks (https://databricks.com/). Agenda: 6:30 - 7:00 pm Mingling & Refreshments 7:00 - 7:10 pm Welcome opening remarks, announcements, acknowledgments, and introductions 7:10 - 7:50 pm Workday Prism Analytics: Unifying Interactive and Batch Data Processing Using Apache Spark 7:50 - 8:30 pm Upcoming Release Apache Spark 2.3: What’s New? 8:30 - 8:45 pm More Mingling & Networking Tech-Talk 1: Workday Prism Analytics: Unifying Interactive and Batch Data Processing Using Apache Spark. Abstract: Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. To prepare data for analysis, business users can clean up and transform their datasets in an interactive, modern data prep environment. Thus, Workday Prism Analytics needs to run three types of scalable data processing applications: “always on” query engine and data prep applications, and on-demand batch execution of transformation pipelines. We standardized on Apache Spark and Spark SQL for all three applications, due to its scalability, as well as, flexibility and extensibility of the Catalyst compiler. All applications share much of the compilation and execution code, except for sampling, caching, and result extraction. In this talk we will, first, introduce Workday Prism Analytics and describe its Spark-based interactive and batch data processing components. We will then describe the data prep transformations, and their compilation into Spark DataFrames, through Spark-SQL Catalyst plans, in both interactive and batch mode. We will focus on some challenges we encountered while compiling and executing complex pipelines and queries. For example, Spark SQL compilation times exceeded execution time for some low-latency queries. And compiled plans grew dangerously for data prep pipelines with multiple self-joins and self-unions. We will describe caching, sampling, and query compilation techniques that allow us to support interactive user experience. Finally, we will conclude with an overview of the open challenges that we plan to tackle in the future. Bio: Dr. Andrey Balmin is a Sr. Principal Engineer at Workday, where he is building the self-service Prism Analytics platform. His work on the foundational technology for Prism began at Platfora (which was acquired by Workday). Prior to this, he was a Research Staff Member at IBM Almaden Research Center where he focused on search and query processing of semi-structured and graph-structured data in Data Warehousing and, later, Big Data platforms. He holds a Ph.D. degree in computer science from UC San Diego. Tech-Talk 2: Upcoming Release of Apache Spark 2.3: What’s New Abstract: Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2. Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features: • Kubernetes Scheduler Backend • PySpark Performance and Enhancements • Continuous Structured Streaming Processing • DataSource v2 APIs • Structured Streaming v2 APIs Bio: Xiao Li is a software engineer at Databricks. His main interests are in Spark SQL, data replication, and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication. He received his Ph.D. from the University of Florida in 2011. He is a Spark committer/PMC Parking Instructions: The parking lot is under construction, but visitors may enter via the Madison Ave.

    19
  • Women in Big Data Apache Spark Meetup @ Databricks

    Databricks Inc.

    Hosted and moderated by Maddie Schults (https://www.linkedin.com/in/maddieschults/) and Vida Ha (https://www.linkedin.com/in/vidaha/) from Databricks (https://databricks.com/), please join us for an evening of Bay Area Apache Spark Meetup featuring diversity and tech-talks from women educators and engineers in data science, computer science, and education. Thanks to Databricks (https://databricks.com) for hosting and sponsoring this meetup. Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions 6:40 - 7:15 pm Tech Talk 1 by Colleen Lewis 7:15 - 7:50 pm Community Tech Talk 2 by Kay Ousterhout 7:50 - 8:25 pm Databricks Tech Talk 3 by Sue Ann Hong 8:25 - 8:45 pm More Mingling & Networking Tech-Talk 1: Fitting in CS when the stereotypes don't fit Abstract: The stereotypes of computer scientists just aren't flattering. Probably every computer scientist can think of dimensions of the stereotype that just doesn't fit. Why do these stereotypes of computer scientists matter? And how might we change them and the tech industry more broadly? Learn about how Harvey Mudd College went about changing the culture of CS to go from a major with about 10% women in 2006 to over 50% in 2016. Bio: Dr. Colleen Lewis is an Assistant Professor of CS at Harvey Mudd College where she has taught classes in both CS and social justice since 2012. Colleen frequently speaks about diversity and inclusion and has spoken at Amazon, Qualcomm, City National Bank, TurnItIn, Grace Hopper, SXSWedu, LA TechWeek, CalTech, USC, Rice, Northwestern, and the British Computing Society. She has conducted four workshops and given 33 invited talks focused on diversity and inclusion. Colleen is featured in the documentary Code: Debugging the Gender Gap. Colleen’s research is focused on how people learn CS and how people feel about learning CS. Half of her 20 peer-reviewed publications focus on diversity and inclusion within CS. At Grace Hopper in 2016, Colleen won the Denice Denton Emerging Leader Award for her work promoting diversity and inclusion. Colleen's research is funded by a $750k grant from the NSF. Tech-Talk 2: Apache Spark Performance: Past, Future, and Present Abstract: Apache Spark performance is notoriously difficult to reason about. Spark’s parallelized architecture makes it difficult to identify bottlenecks when jobs are running, and as a result, users often struggle to determine how to optimize their jobs for the best performance. This talk will take a deep dive into techniques for identifying resource bottlenecks in Spark. I’ll begin with the past, and discuss instrumentation that was added to Spark to measure how long jobs spend waiting on disk and network I/O. Next, I’ll discuss future-looking work from the research community that explores an alternative architecture for Spark based on using single-resource monotasks. Using monotasks makes it trivial for users to understand bottlenecks and predict their workloads’ performance under different hardware and software configuration. This future-looking approach requires a radical re-architecting of Spark’s internals, so I’ll end with the present, and describe how lessons from that work could be applied to Spark today to give users much more information about the performance of their workloads. Bio: Kay Ousterhout is an Apache Spark PMC member and a recent UC Berkeley Ph.D. graduate. Kay’s Ph.D. research focused on understanding and improving the performance of large-scale data analytics frameworks. In the Spark project, Kay has focused on improving scheduler performance. Tech-Talk 3: Deep Learning Pipelines: Enabling AI in Production Abstract: Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is an Apache Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk, we discuss the philosophy behind Deep Learning Pipelines, as well as the main tools it provides, how they fit into the deep learning ecosystem, and how they demonstrate Spark's role in deep learning. Bio: Sue Ann Hong is a software engineer in the Machine Learning team at Databricks where she contributes to MLlib and Deep Learning Pipelines Library. She got her Ph.D. at CMU studying machine learning and distributed optimization and worked as a software engineer at Facebook in Ads and Commerce. See you there!

    23
  • Pre-Spark Summit Meetup in Dublin, Ireland

    The Convention Center Dublin

    Apache Spark Meetup in Dublin, Ireland, for the Spark Summit EU (https://spark-summit.org/eu-2017/) 2017 Are you planning to attend Spark Summit? Do you live in Dublin? Are planning to be in Dublin the week of Spark Summit? Then join the EU Apache Spark Community and RSVP now at this URL: https://www.eventbrite.com/e/pre-spark-summit-meetup-in-dublin-ireland-tickets-37826429870

    1
  • Apache Spark Saturday #2 in DC

    Capital One

    Capital One Conference Center, McLean, VA *** REGISTRATION FOR THIS EVENT MUST BE COMPLETED HERE AND NOT VIA MEETUP: https://goo.gl/9r663d *** Join us for the second annual Spark Saturday Event hosted by Capital One (https://www.capitalone.com/), MetiStream, (http://www.metistream.com/) and Databricks (https://databricks.com) in partnership with the Washington DC Apache Spark Interactive and Bay Area Spark Meetup! This is a FREE event hosted by a community of Spark enthusiasts to foster Big Data innovation in the DC Area and support the fastest growing Big Data project - Apache Spark. Hear from the industry experts who are actively contributing to and shaping the development of Apache Spark and successfully implementing Spark into production. Read more here (https://sparksaturday.wordpress.com/spark-saturday-dc-2/about/) ( https://goo.gl/P3UBtV ). Our agenda includes: - Learn how Capital One is leveraging Apache Spark to prevent credit card fraud. - Learn how to use Apache Spark properly in your Big Data Architecture. - Learn how to serialize and deploy Apache Spark Machine Learning algorithms. TRAINING Take a free Apache Spark training/workshop class taught by Databricks, founders of Apache Spark, and MetiStream, Certified Spark System Integrators and Spark Trainers. Two separate 2.5 hour classes will be available for beginner and advanced students. The beginner class will cover concepts on Apache Spark fundamentals, DataFrames, Datasets, SparkSession, Spark SQL, and Structured Streaming, all with hands-on workshops conducted on Databricks Community Edition (https://databricks.com/try). The advanced class will use Hail, an open-source genomic process tool built on Spark, to take participants through a hands-on healthcare use case and demonstrate Spark’s machine learning capabilities on Databricks. We will leverage the Databricks Community Edition - designed for developers, data scientists, data engineers and anyone who wants to learn Spark. In our training class, we will provide an Apache Spark Quick Start and review various demos to help jump start your Spark education. Training tickets are subject to availability. You will be notified a week before the event if a seat is available. AGENDA: - 8:00 am: Registration and Breakfast - 9:00 am: Kick-off - 11:00 am to 1:30 pm: Two Training Sessions - 10:00 am to 4 pm: Multiple Tech-talk sessions The event will include free breakfast and lunch. We will follow the event with a happy hour and drinks (cash bar). More details to follow. WHO SHOULD ATTEND: Anyone interested in learning more about Spark including advanced and new learners. Business professionals, technologists, and students are welcome. PARKING: Free parking is plentiful and available on-site