- Koalas: pandas APIs on Apache Spark/Magenta: Exploring the role on ML
This is a cross-posting event organized by our friends at SF PyData Meetup. PLEASE RSVP HERE: https://www.meetup.com/San-Francisco-PyData/events/262203749/ ## Agenda: 6:00 - 6:45pm: Mingling 6:45 - 6:50pm: Opening remarks 6:50 - 7:35pm: Tech-Talk-1: Koalas: pandas APIs on Apache Spark 7:35 - 8:05pm: Tech-Talk-2: Magenta: Exploring the role of machine learning in creativity 8:05 - 8:30pm: Mingling 8:30: Event over! #### Abstract In this talk, Reynold will present Koalas, a new open source project that was announced at the Spark + AI Summit in April. Koalas is a Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework. Reynold will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how he envisions Koalas could become the standard API for large scale data science. #### Speaker Bio Reynold Xin is a cofounder and Chief Architect at Databricks. In the open source community, Reynold is known as a top contributor to the Apache Spark project, having designed many of its core user-facing APIs and execution engine features. Reynold received a PhD in Computer Science from UC Berkeley, where he worked on large-scale data processing systems. ## Magenta: Exploring the role of machine learning in creativity #### Abstract: Hanoi will be giving an overview on Magenta, an open source team at Google led by Doug Eck, exploring the role of machine learning in the creative process. The talk will cover exciting and recent developments in the world of AI-generated art as well as live musical demos! #### Speaker Bio: Lamtharn (Hanoi) Hantrakul - AI Resident, Google Brain Hanoi - like the towers? No. Like the city? Yes! Hanoi's Thai parents fell in love there and nicknamed him after the charming city. Born and raised in Bangkok, Thailand, he is currently an AI Resident with Google Brain. His research focuses on real-time Neural Audio Synthesis: rendering sound directly using deep neural networks. Hanoi strives for technologies that are transcultural at heart; diversifying software, hardware and AI to encompass underrepresented cultures such as those from his home region of Thailand and beyond. He is most proud of fidular, a modular fiddle system he designed and engineered that enables components like resonators and strings to be swapped across cultures. The system is currently on display at the Musical Instruments Museum in Phoenix, AZ, and has been recognized internationally by the A’ Design Award and Core77 Design Award. Hanoi hold degrees in Applied Physics and Music Composition, both with Distinction from Yale University. In his MSc thesis at Georgia Tech, he developed machine learning models for an ultrasound sensor that enables amputees to perform high-dexterity tasks like playing piano; an impossible feat using today’s sensors on conventional prosthetics. He also writes music under the moniker "yaboi hanoi". Find his tunes on Instagram and Spotify!
- Bay Area Apache Spark Meetup @ Salesforce SF
Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark, Machine Learning, and Delta Lake at scale from Salesforce and Databricks This meetup is hosted and sponsored by Salesforce. Agenda: 6:00 - 6:30 pm: Social Hour with Food & Drinks 6:30 - 6:35 pm: Introduction & Announcements 6:35 - 7:15 pm: Tech Talk from Salesforce Einstein 7:15 - 7:55 pm: Tech Talk from Databricks: Delta Lake 7:55 - 8:15 pm: Additional Networking, Q&A Talk 1 Title: Apache Spark at Salesforce Einstein for Marketing Cloud Presenters: Peter Krmpotic and Kexin Xie Abstract: Salesforce Einstein makes more than 6.5B+ predictions per day to deliver highly personalized customer experiences on behalf of some of the biggest brands in the world and their engineering/product teams have an extensive background in using Apache Spark in production. This talk will cover some of their biggest use cases, their successful transition from Apache Hadoop to Apache Spark, valuable insights about using Spark in production and an architecture review for one of their core capabilities. Bio: Peter Krmpotic Peter Krmpotic is a director of product management for Salesforce Marketing Cloud Einstein and former product leader at Adobe Experience Cloud, BrightEdge and Boost Media. This career experience has given him a comprehensive view of what brands must develop to provide unified customer experiences at scale, the bedrock of any digitally transformed organization. He works with customers to instill customer-centricity to set them up as AI-first companies and leaders of the technology-based emergence of the fourth industrial revolution. In his current role at Salesforce, Peter focuses on democratizing artificial intelligence, specifically deep learning, and natural language processing, for the purpose of personalizing customer experiences at scale. He holds MSc in Computer Science from Karlsruhe Institute of Technology (KIT) in Germany Bio: Kexin Xie At Salesforce, Kexin Xie is responsible for researching and designing the core distributed data processing and machine learning architecture for Marketing Cloud Einstein and Salesforce DMP. He leads a team of data science engineers with a strong focus on continuously improving operational aspects such as performance, fault tolerance, scale, automation, and cost. Before Salesforce, Kexin worked for Krux, BigCommerce, NICTA, Brandscreen, Freelancer and Microsoft Research building large-scale machine learning, data mining, real-time bidding, intelligent marketing, anti-fraud and anti-money laundering software systems. Kexin also holds a Ph.D. degree in computer science. Talk 2 Title: Open Source Reliability for Data Lake with Apache Spark Presenter: Michael Armbrust Abstract: Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs. In this talk, we will cover * What data quality problems Delta helps address * How to convert your existing application to delta * How the Delta transaction protocol works internally * The Delta roadmap for the next few releases * How to get involved! Bio: Michael Armbrust is a committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and the Delta Lake open source project. He received his Ph.D. from UC Berkeley in 2013 and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage, and query optimization.
- Spark + AI Summit Meetup @ Moscone Center West, San Francisco
Join us for an evening of Bay Area Apache Spark Meetup at the Spark + AI Summit (https://databricks.com/sparkaisummit/north-america) featuring tech-talks from Google Inc and Workday Inc on Machine Learning. Thanks to #WomenInUnifiedAnalytics & Diversity Team at Databricks for sponsoring this meetup. (Note: This meetup is open to everyone. You don’t have to be registered for Spark + AI Summit.) Agenda: 6:00 - 6:30 pm Happy Hour: Mingling & Refreshments 6:30 - 6:40 pm Opening Remarks (Jules Damji, Databricks) 6:40 - 7:25 pm Talk-1: Paige Bailey, Google 7:25 - 8:05 pm Talk-2: Madhura Dudhgaonkar, Workday 8:05 - 9:00 pm More Mingling & Networking Talk 1: Details coming soon... Abstract: Bio: Paige Bailey (https://www.linkedin.com/in/dynamicwebpaige) is a Senior TensorFlow Developer Advocate at Google, based in Mountain View, CA. Prior to joining Google, Paige worked as a senior software engineer in the office of the Azure CTO; as a Cloud Developer Advocate for machine learning at Microsoft; and as a data scientist for Chevron in Houston, TX. Paige has over a decade of experience using Python for data analysis, five years of experience doing machine learning - and can't wait to show you about the new capabilities in TensorFlow 2.0. Talk 2: Machine Learning Products - How do you begin and when do you scale? Abstract: So you have heard all the hype around how Machine Learning is going to change the world. But within your business context, where do you start? How do you choose the right use cases to begin with? How do you get leadership buy-in for more investment? And when do you start thinking about scaling your ML Services? In this session, you will walk away with an actionable framework to bootstrap and scale an applied machine learning services function. You will see the framework in action through an actual 0 to 1 product journey involving deep learning where we delivered value in record speed in spite of not having a dataset when we started. You will get practical tips on how to make decisions about when and how to scale your capability to scale ML Services and platform. Furthermore, how to get leadership buy-in for more investment? You will go back with some counter-intuitive tips that we discovered as a result of productizing ML services over the last 5+ years using a diverse range of technologies: Vision, Language, Graph, Anomaly Detection, Search Relevance, Personalization. Bio: Madhura Dudhgaonkar (https://www.linkedin.com/in/madhurad) is a Machine Learning leader at Workday passionate about modernizing the future of work. Her team, a pioneer in the Enterprise Machine Learning space, has spent 6+ years building ML products leveraging Vision, Natural Language Processing, Recommendations, Anomaly Detection and more. Madhura’s career journey goes from being a hands-on engineer to leading large organizations across SUN Microsystems, Adobe and now Workday. Her background covers building both consumer and enterprise products - the latest of them involving multiple 0 to 1 product journeys leveraging Machine Learning. She is considered a thought leader in building ML products and is frequently invited to speak at AI conferences. Madhura holds a master’s degree in Math and Computer Science. She also is passionate about building diverse teams and leads diversity and inclusion work via leading a Women at Workday chapter.
- Bay Area Apache Spark Meetup @ Unravel Data in SF
Let's kick off the New Year 2019 with our first BASM Meetup! Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark at scale from Unravel Data and Databricks. Agenda: 6:30 - 7:00 pm: Social Hour with Food, Drinks, Beer & Wine 7:00 - 7:05 pm: Jules Introduction & Announcements 7:05 - 7:50 pm: Tech Talk from Unravel Data 8:00 - 8:45 pm: Tech Talk from Databricks 8:45 - 9:00 pm: Unravel Raffle and Additional Networking, Q&A Tech Talk 1: Putting AI to Work on Apache Spark Presenter: Shivnath Babu Abstract: Apache Spark simplifies AI, but why not use AI to simplify Spark performance and operations management? An AI-driven approach can drastically reduce the time Spark application developers and operations teams spend troubleshooting problems. This talk will discuss algorithms that run real-time streaming pipelines as well as build ML models in batch to enable Spark users to automatically solve problems like: (i) fixing a failed Spark application, (ii) auto tuning SLA-bound Spark streaming pipelines, (iii) identifying the best broadcast joins and caching for SparkSQL queries and tables, (iv) picking cost-effective machine types and container sizes to run Spark workloads on the AWS, Azure, and Google cloud; and more. Bio: CTO and Co-Founder at Unravel Data Systems and an adjunct professor of computer science at Duke University. Shivnath co-founded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award. Tech Talk 2: Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark Presenter: Lu Wang Abstract: Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference. In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning. We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload. Bio: Lu Wang is a software engineer at Databricks. His main research interests are developing high-performance parallel algorithms for scientific computing and applications. He was actively involved in the development of the Project Hydrogen, Spark Deep Learning pipelines, and Spark MLlib since he joined Databricks. Before Databricks, he was working on parallel multigrid linear solvers on exascale parallel machines for solving the linear systems from reservoir simulations at Lawrence Livermore national laboratory. He received his Ph.D. from the Pennsylvania State University in 2014.
- Bay Area Apache Spark Meetup @ Adobe in San Jose, CA
Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark from Adobe (https://www.adobe.com/) and Apache Spark Committer from Databricks (https://databricks.com). Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions (Jules Damji + Adobe) 6:40 - 7:20 pm Apache Spark at Adobe 7:20 - 8:00 pm Upcoming Apache Spark 2.4: What’s New & Why Should You Care 8:00 - 8:30 pm More Mingling & Networking Tech-Talk 1: Apache Spark at Adobe Abstract: The Adobe Cloud Platform is a multi-tenant, big data stack as a service on the cloud which provides the modern foundation for all the various parts of the Adobe Experience Cloud. From a data processing perspective, Adobe uses Apache Spark in a variety of scenarios. We will talk about the high-level data architecture, briefly touching on the infrastructure/scale/challenges, and lastly, we will cover how we are leveraging Spark. As part of the Cloud Platform, we have also built a Query Engine leveraging Spark SQL for ad-hoc data querying. The Query Engine has implemented a PostgreSQL protocol and leverages Akka Streams and the Presto Parser as an abstraction layer around Spark SQL. We will talk about the high-level architecture and talk about the various patches made to Spark SQL such as support for nested column pruning that are critical to our performance needs when accessing data with thousands of nested columns. Bio: Yogesh Natarajan is a senior software engineer in the Data Platform group at Adobe. His interests include building server-side web applications and big data technologies. He has previously worked at Chegg, Yahoo and graduated with a masters from UC Irvine Andrew is a senior software engineer in the Data Platform group at Adobe. He specializes in building modern, scalable, cloud-based Scala applications. Tech-Talk 2: Upcoming Apache Spark 2.4: What’s New & Why Should You Care Abstract: The upcoming Apache Spark 2.4 release is the fifth release in the 2.x series. This talk will provide an overview of the major features and enhancements in this upcoming release. * A new scheduling model (Barrier Scheduling) to enable users to properly embed distributed Deep Learning training as a Spark stage to simplify the distributed training workflow. * 35 high-order functions are added for manipulating arrays/maps in Spark SQL. * A new native AVRO data source, based on Databricks' spark-avro module. * PySpark also introduces eager evaluation mode on all operations for teaching and debuggability. * Spark on K8S adds PySpark and R support and client-mode support. * Various enhancements in structured streaming. e.g., stateful operators in continuous processing. * Various performance improvement in built-in data sources. e.g., Parquet nested schema pruning. Bio: Xiao Li is a software engineer at Databricks. His main interests are in Spark SQL, data replication, and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication. He received his Ph.D. from the University of Florida in 2011. He is a Spark committer/PMC PARKING: All visitors attending the Adobe/Spark Meetup on ET 01 Park Conference room will need to park in the East Tower basement level 1.
- Bay Area Apache Spark Meetup @ Databricks, HQ in San Francisco
Join us for an evening of Bay Area Apache Spark Meetup featuring open-source tech-talks about using and innovating with Apache Spark from Databricks (https://databricks.com). Thanks to Databricks for hosting and sponsoring this meetup. Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements & introductions (Jules S. Damji + Reynold Xin) 6:40 - 7:25 pm Tech-Talk-1 Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark 7:25 - 8:10 pm Tech-Talk-2 MLflow: Infrastructure for a Complete Machine Learning Life Cycle 8:10 - 8:30 pm Mingling & Networking Tech-Talk 1: Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark Abstract: Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models AI has always been on of the most exciting applications of big data and Apache Spark. Increasingly Spark users want to integrate Spark with distributed deep learning and machine learning frameworks built for state-of-the-art training. On the other side, increasingly DL/AI users want to handle large and complex data scenarios needed for their production pipelines. This talk introduces a new project that substantially improves the performance and fault-recovery of distributed deep learning and machine learning frameworks on Spark. We will introduce the major directions and provide progress updates, including 1) barrier execution mode for distributed DL training, 2) fast data exchange between Spark and DL frameworks, and 3) accelerator-awareness scheduling. Bio: Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems. Tech-Talk 2: MLflow: Infrastructure for a Complete Machine Learning Life Cycle Abstract: ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom “ML platforms” that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company’s internal infrastructure. In this talk, we will present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size. Bio: Members from the MLflow Team (https://www.mlflow.org/)
- Spark + AI Summit: Bay Area Apache Spark Meetup @ Moscone Center, SF
Moscone center, San Francisco, CA
Moscone Center Room 2014 Join us for an evening of Bay Area Apache Spark Meetup at the Spark + AI Summit (https://databricks.com/sparkaisummit/north-america)featuring tech-talks from Databricks (https://databricks.com/), Uber, (https://www.uber.com/) and Stanford University (https://www.stanford.edu/). Thanks to Databricks for hosting and sponsoring this meetup. (Note: This meetup is open to everyone. You don’t have to be registered for Spark + AI Summit.) Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Opening Remarks & Introductions, Jules Damji, Databricks 6:40 - 7:20 pm Tech Talk-1: Richard Garris, Databricks 7:20 - 8:00 pm Tech Talk-2: Alexander Sergeev, Uber 8:00 - 8:05 pm Short Break 8:05 - 8:45 pm Tech Talk-3: Peter Kraft, Stanford University 8:45 - 9:00 pm More Mingling & Networking Tech-Talk 1: Understanding Parallelization of Machine Learning Algorithms in Apache Spark™ Abstract: Machine Learning (ML) is a subset of Artificial Intelligence (AI). In this talk, Richard Garris, Principal Architect at Databricks will explain how various ML algorithms are parallelized in Apache Spark. Andrew Ng calls the algorithms the "rocket ship" and the data "the fuel that you feed machine learning" to build deep learning applications. We will start with an understanding of machine learning pipelines built using single machine algorithms including Pandas, scikit-learn, and R. Then we will discuss how Apache Spark MLlib can be used to parallelize your machine learning pipeline with Linear Regression and Random Forest. Lastly, we will discuss ways to parallelize single machine algorithms in Spark by broadcasting the data and then performing distributed feature selection, model creation or hyperparameter tuning. Bio: Richard Garris is a Principal Solutions Architect at Databricks focused on helping clients with their Advanced Analytics initiatives using Apache Spark and MLlib. He has spent 13 years working with enterprises in data management and analytics. Richard got his undergraduate degree at The Ohio State University and Masters in Software Management from CMU. His previous work experience includes Skytree, Google, and PwC. Tech-Talk 2: Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow Abstract: Horovod makes it easy to train single-GPU TensorFlow model on many GPUs - both on a single server and across multiple servers. The talk will touch upon mechanisms of deep learning training, challenges that distributed deep learning poses, mechanics of Horovod, as well as practical steps necessary to train a deep learning model on your favorite cluster. Bio: Alex Sergeev is a Deep Learning Infrastructure Engineer at Uber working on scalable Deep Learning. He received his MS. degree in Computer Science from Moscow Engineering Physics Institute. Before joining Uber, he was a Senior Software Engineer at Microsoft working on Big Data Mining. Tech-Talk 3: Apache Spark™ and MacroBase Abstract: In this talk, we present MacroBase, an analytics system we have built at Stanford University that uses Apache Spark to prioritize human attention via large-scale feature selection. In a world swamped with enormous datasets and an enormous variety of complex tools to analyze them, MacroBase specializes in one task: finding and explaining unusual or interesting trends on data as easily as possible. Specifically, it searches for correlations in large-scale datasets. For example, an app developer wondering why their app was crashing could ask MacroBase to find factors in their logs that correlate with crash behavior and explain the crashes. Alternatively, an analyst looking for trends in time series data could ask MacroBase to find changes over time. MacroBase relies on Spark and Spark-SQL to provide fast and easy-to-use analytics. Users operate MacroBase using MacroBase-SQL, an extension of SQL that introduces new operators to partition datasets and find explanations on partitions. MacroBase-SQL is built on top of Spark-SQL, with its new operators taking in and returning Spark dataframes. This means that MacroBase is fully distributed and can easily be integrated into any system already running Spark or Spark-SQL. In this talk, we will explain how we built MacroBase new operators on top of Spark and what you can do with them. Bio: Peter Kraft is a first-year graduate student at Stanford advised by Peter Bailis and Matei Zaharia. He is interested in solving problems at the intersection of systems and machine learning and in building more usable and powerful machine learning systems.
- BASM @ Bloomberg in San Francisco
Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks using Apache Spark from Bloomberg (https://www.bloomberg.com/) and Databricks (https://databricks.com/). Thanks to Bloomberg (https://www.bloomberg.com/)for hosting and sponsoring this meetup. Bloomberg Security Building requires that you must fill out this form if you RSVP: https://goo.gl/forms/wjeDeg6HPLAeeXIC3 Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:35 pm Welcome opening remarks, announcements, acknowledgments, and introductions 6:35 - 7:15 pm Bloomberg Ilan Filonenko: Apache Spark on K8s and HDFS Security 7:20 - 8:00 pm Databricks Jules S. Damji: What’s New in Apache Spark 2.3 and Why Should You Care 8:00 - 8:30 pm Mingling Tech-Talk 1: Apache Spark on K8s and HDFS Security Abstract: There is growing interest in running Apache Spark natively on Kubernetes. lan Filonenko will explain the design idioms, architecture and internal mechanics of Spark orchestrations over Kubernetes. Since data for Spark analytics is often stored in HDFS, Ilan will also explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as data locality and security through the use of Kubernetes constructs such as secrets and RBAC rules Bio: Ilan Filonenko, Software Engineer, Bloomberg Ilan Filonenko is a member of the Data Science Infrastructure team at Bloomberg, where he has designed and implemented distributed systems at both the application and infrastructure level. Previously, Ilan was an engineering consultant and technical lead in various startups and research divisions across multiple industry verticals, including medicine, hospitality, finance, and music. Ilan’s research has focused on algorithmic, software, and hardware techniques for high-performance machine learning, with a focus on optimizing stochastic algorithms such as stochastic gradient descent (SGD). Tech-Talk 2: What’s New in Apache Spark 2.3 and Why Should You Care Abstract: The Apache Spark 2.3 release marks a big step forward in speed, unification, and API support. This talk will quickly walk through what’s new and how you can benefit from the upcoming improvements: * Continuous Processing in Structured Streaming. * PySpark support for vectorization, giving Python developers the ability to run native Python code fast. * Native Kubernetes support, marrying the best of container orchestration and distributed data processing. Bio: Jules S. Damji is an Apache Spark Community and Developer Advocate at Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, @Home, LoudCloud/Opsware, VeriSign, ProQuest, and Hortonworks, building large-scale distributed systems. He holds a B.Sc and M.Sc in Computer Science and MA in Political Advocacy and Communication from Oregon State University, Cal State, and Johns Hopkins University respectively.
- Bay Area Apache Spark & Women in Big Data @ Databricks HQ, SF
Hosted and moderated by Maddie Schults (https://www.linkedin.com/in/maddieschults/) from Databricks (https://databricks.com/), please join us for an evening of Bay Area Apache Spark and WiBD (https://www.womeninbigdata.org/blog/) Meetup featuring tech-talks from women in engineering. Thanks to Databricks (https://databricks.com) for hosting and sponsoring this meetup. Agenda: 6:00 - 6:30 pm Mingling & Refreshments 6:30 - 6:40 pm Welcome opening remarks, announcements, acknowledgments, and introductions 6:40 - 7:15 pm Holden Karau: Bringing a Jewel (as a starter) from the Python world to the JVM with Apache Spark, Arrow, and Spacy 7:15 - 7:50 pm Anya Bida: Just enough DevOps for Data Scientists (Part II) 7:50 - 8:25 pm Shan He: Creating Beautiful and Meaningful Visualizations with Big Data 8:25 - 8:45 pm More Mingling & Networking Tech-Talk 1: Details Coming Soon Abstract: With the new Apache Arrow integration in PySpark 2.3, it is now starting become reasonable to look to the Python world and ask “what else do we want to steal besides tensorflow”, or as a Python developer look and say “how can I get my code into production without it being rewritten into a mess of Java?” Regardless of your specific side(s) in the JVM/Python divide, collaboration is getting a lot faster, so let's learn how to share! In this brief talk we will examine sharing some of the wonders of Spacy with the Java world, which still has a somewhat lackluster set of options for NLP. Bio: Holden Karau (https://www.linkedin.com/in/holdenkarau/) Tech-Talk 2: Just enough DevOps for Data Scientists (Part II) Abstract: Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully. Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018 https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s Bio: Anya Bida (https://www.linkedin.com/in/anyabida/) Abstract: Tech-Talk 3: Creating Beautiful and Meaningful Visualizations with Big Data Abstract: At Uber, location data is our biggest asset. How do we create data visualizations with rich location data, render a million points of events in the blink of an eye, and, most importantly, derive insights from them? In this presentation, you'll get a behind the scenes look at the tools and data visualizations we use at Uber to inform business decisions. I will walk us through an overview of the data visualization process with a case study, discuss how and why we built our own visualization tool to visualize location data in a more meaningful way. I will also show that you can create beautiful visualizations, but in order for them to be useful, you have to understand the information you are designing. Bio: https://www.linkedin.com/in/shan-he-25400b16/
- Bay Area Apache Spark Meetup @ Workday in San Mateo
Happy New Year! Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks on Apache Spark (https://spark.apache.org/) at scale from Workday (https://www.workday.com/) and Databricks (https://databricks.com/). Agenda: 6:30 - 7:00 pm Mingling & Refreshments 7:00 - 7:10 pm Welcome opening remarks, announcements, acknowledgments, and introductions 7:10 - 7:50 pm Workday Prism Analytics: Unifying Interactive and Batch Data Processing Using Apache Spark 7:50 - 8:30 pm Upcoming Release Apache Spark 2.3: What’s New? 8:30 - 8:45 pm More Mingling & Networking Tech-Talk 1: Workday Prism Analytics: Unifying Interactive and Batch Data Processing Using Apache Spark. Abstract: Workday Prism Analytics enables data discovery and interactive Business Intelligence analysis for Workday customers. To prepare data for analysis, business users can clean up and transform their datasets in an interactive, modern data prep environment. Thus, Workday Prism Analytics needs to run three types of scalable data processing applications: “always on” query engine and data prep applications, and on-demand batch execution of transformation pipelines. We standardized on Apache Spark and Spark SQL for all three applications, due to its scalability, as well as, flexibility and extensibility of the Catalyst compiler. All applications share much of the compilation and execution code, except for sampling, caching, and result extraction. In this talk we will, first, introduce Workday Prism Analytics and describe its Spark-based interactive and batch data processing components. We will then describe the data prep transformations, and their compilation into Spark DataFrames, through Spark-SQL Catalyst plans, in both interactive and batch mode. We will focus on some challenges we encountered while compiling and executing complex pipelines and queries. For example, Spark SQL compilation times exceeded execution time for some low-latency queries. And compiled plans grew dangerously for data prep pipelines with multiple self-joins and self-unions. We will describe caching, sampling, and query compilation techniques that allow us to support interactive user experience. Finally, we will conclude with an overview of the open challenges that we plan to tackle in the future. Bio: Dr. Andrey Balmin is a Sr. Principal Engineer at Workday, where he is building the self-service Prism Analytics platform. His work on the foundational technology for Prism began at Platfora (which was acquired by Workday). Prior to this, he was a Research Staff Member at IBM Almaden Research Center where he focused on search and query processing of semi-structured and graph-structured data in Data Warehousing and, later, Big Data platforms. He holds a Ph.D. degree in computer science from UC San Diego. Tech-Talk 2: Upcoming Release of Apache Spark 2.3: What’s New Abstract: Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2. Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features: • Kubernetes Scheduler Backend • PySpark Performance and Enhancements • Continuous Structured Streaming Processing • DataSource v2 APIs • Structured Streaming v2 APIs Bio: Xiao Li is a software engineer at Databricks. His main interests are in Spark SQL, data replication, and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication. He received his Ph.D. from the University of Florida in 2011. He is a Spark committer/PMC Parking Instructions: The parking lot is under construction, but visitors may enter via the Madison Ave.