• Data Engineering: Dagster and Spark SQL

    Online event

    **Please note that this meetup is a cross promotion: https://www.meetup.com/data-ai-online/events/278189545/ Join us for 3 tech talks on Data Engineering about Spark SQL and Dagster!

    *REGISTER NOW* (for FREE): https://databricks.com/dataaisummit/north-america-2021

    Talk 1: Faster Spark SQL: Adaptive Query Execution in Databricks by Allison & Maryann
    Abstract: Over the years, there has been extensive and continuous effort on improving Spark SQL's query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.

    Adaptive Query Execution, now looks to tackle such issues by re-optimizing and adjusting query plans based on runtime statistics collected in the process of query execution. This talk is going to introduce the adaptive query execution framework along with a few optimizations it employs to address some major performance challenges the industry faces when using Spark SQL. We will illustrate how these statistics-guided optimizations work to accelerate execution through query examples. Finally, we will share the significant performance improvement we have seen on the TPC-DS benchmark with Adaptive Query Execution.

    Talk 2: Apache Spark SQL optimizations for machine learning across internet-sized data by Michael & Wenzhe
    Abstract: Quantcast regularly deals with internet-sized data (100s of billions of events per day) in order to train models that optimize advertising online. For the past 2 years, Quantcast has been investing into spark as the backbone of our new and experimental data processing pipelines. From this work we have learned several Spark SQL optimizations that can make our problems orders of magnitude faster than the naive approach. We will describe how we use these optimizations in our pipelines with examples on sanitized data and include: Data transformations to minimize query costs, leveraging natural features in the data set to efficiently group and process it with pandas UDFs, and employing Low-level optimizations in python using vectorization and JIT for faster Python execution.

    Talk 3: Introduction, principles and origin of Dagster by Nick
    Abstract: Nick will cover the principles and origin of Dagster. Dagster is a new type of workflow engine: a data orchestrator. Moving beyond just managing the ordering and physical execution of data computations, Dagster considers the entire data application lifecycle. Practitioners in Dagster build data-aware dependency graphs designed for local development and testing; deploy those graphs to multi-tenant, cloud-native orchestration engine; and then monitor and observe the data assets produced by those computations.

    In this talk, Nick will cover how Dagster differentiates itself across the three stages (dev & test, deploy & execute, monitor & observer) of the application lifecycle. Through a demo and code snippets, the talk aims to show how the Dagit web UI and Dagster programming model can power a variety of data practitioners.

    Speakers

    ** Nick Schrock is the founder and CEO of Elementl, the company behind Dagster. Previously, Nick worked at Facebook, where he co-created GraphQL.

    ** Michael Tong is a Machine Learning Engineer at Quantcast. His current projects at Quantcast focus on developing model training pipelines to process petabytes of data to train tens of thousands of models.

    ** Wenzhe (David) Xu is a Machine Learning Engineer at Quantcast. He has been applying various machine learning techniques to large-scale graphs by utilizing Spark SQL.

    ** Maryann Xue is a staff software engineer at Databricks, committer and PMC member of Apache Calcite and Apache Phoenix.

    ** Allison Wang is a software engineer at Databricks, primarily focusing on Spark SQL.

  • Online Workshop: Introduction to Apache Spark

    Online event

    Our affiliated Data + AI Online meetup is hosting an upcoming workshop: Introduction to Apache Spark.

    This is a cross-post, please RSVP here: https://www.meetup.com/data-ai-online/events/270166620/

    Join us for Part 4 of our online learning series: Introduction to Data Analysis for Aspiring Data Scientists. This is the final online workshop in this series for anyone and everyone interested in learning about data analysis.

    Part 4: Introduction to Apache Spark

    Abstract: This workshop covers the fundamentals of Apache Spark, the most popular big data processing engine. In this workshop, you will learn how to ingest data with Spark, analyze the Spark UI, and gain a better understanding of distributed computing. We will be using data released by the NY Times (https://github.com/nytimes/covid-19-data). No prior knowledge of Spark is required, but Python experience is highly recommended.

    - Watch Part 1, Intro to Python - https://youtu.be/HBVQAlv8MRQ
    - Watch Part 2: Data Analysis with pandas - https://youtu.be/riSgfbq3jpY
    - Watch Part 3: Machine Learning - https://youtu.be/g103iO-izoI

    More details and to RSVP: https://www.meetup.com/data-ai-online/events/270166620/

    2
  • Machine Learning Lessons Learned from the Field: Interview with Brooke Wenig

    Our affiliated Online Spark Meetup is hosting an online meetup to discuss Delta Lake and Apache Spark. Developer Advocate Denny Lee will interview Brooke Wenig, Machine Learning Practice Lead, on the best practices and patterns when developing, training, and deploying Machine Learning algorithms in production.

    This is a cross-post, please RSVP HERE: https://www.meetup.com/spark-online/events/268966416/

    Agenda: 10AM PST - 11AM PST (GMT-8)

    10AM - 10:40AM - Interview with Brooke
    10:40 - 11:00AM - Q&A

    Speakers:

    Brooke Wenig is the Machine Learning Practice Lead at Databricks. She guides and assists customers in implementing machine learning pipelines, as well as teaching Distributed Machine Learning & Deep Learning courses. She received an MS in Computer Science from UCLA with a focus on distributed machine learning. She speaks Mandarin Chinese fluently and enjoys cycling.

    Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.

    1
  • The Genesis of Delta Lake - An Interview with Burak Yavuz

    Needs a location

    Our affiliated on-line Spark Meetup is hosting an online meetup to discuss Delta Lake and Apache Spark. Join us online to learn how Delta Lake and Apache Spark enable you to build reliable data pipelines.

    This is a cross-post,
    so please RSVP HERE: https://www.meetup.com/spark-online/events/268335912/

    New decade, new start! Let's kick off 2020 with our first online meetup of the year featuring Burak Yavuz, Software Engineer at Databricks, for a talk about the genesis of Delta Lake. Developer Advocate Denny Lee will interview Burak Yavuz to learn about the Delta Lake team's decision-making process and why they designed, architected, and implemented the architecture that it is today. Understand the technical challenges that the team faced, how those challenges were solved, and learn about the plans for the future.

    Agenda: 10AM PST - 11AM PST (GMT-8)

    10AM - 10:40AM - Interview with Burak
    10:40 - 11:00AM - Q&A

    Speakers:

    Burak Yavuz is a Software Engineer at Databricks. He has been contributing to Spark since Spark 1.1 and is the maintainer of Spark Packages. Burak received his BS in Mechanical Engineering at Bogazici University, Istanbul, and his MS in Management Science & Engineering at Stanford.

    Denny Lee is a Developer Advocate at Databricks. He is a hands-on distributed systems and data sciences engineer with extensive experience developing internet-scale infrastructure, data platforms, and predictive analytics systems for both on-premise and cloud environments. He also has a Masters of Biomedical Informatics from Oregon Health and Sciences University and has architected and implemented powerful data solutions for enterprise Healthcare customers. His current technical focuses include Distributed Systems, Apache Spark, Deep Learning, Machine Learning, and Genomics.

    1
  • Bay Area Apache Spark Meetup @ LinkedIn, Sunnyvale

    950 W Maude Ave

    Join us for an evening featuring tech-talks about Apache Spark and Delta Lake at scale from LinkedIn and Databricks.

    This meetup is hosted and sponsored by LinkedIn.

    Agenda:

    6:00 - 6:30 pm: Social Hour with Food & Drinks
    6:30 - 6:35 pm: Introduction & Announcements
    6:35 - 6:55 pm: Tech Talk-1 from LinkedIn
    6:55 - 7:15 pm: Tech Talk-2 from LinkedIn
    7:35 - 8:15 pm: Databricks Talk

    Talk 1 Title: Dali-Spark: Apache Spark Data Access at LinkedIn to Achieve Data Agility
    Presenter: Adwait Tumbde
    Abstract: At LinkedIn, changes to our complex data ecosystem can place steep costs on application developers. We have designed Dali to insulate developers from physical attributes of data like storage format and location. At its core, Dali provides a catalog to define and evolve physical and virtual datasets, a dataset reader that allows applications to read datasets in different environments and a collection of development tools to manage these datasets. In this talk, we will cover
    * What is Dali and how it simplifies complex data ecosystem
    * Dali as unified data access layer at LinkedIn
    * DaliSpark Architecture
    * Roadmap including plans to open source Dali
    Bio: Adwait Tumbde is an engineering manager at LinkedIn and leads a team focused on simplifying data management for big data. He has also contributed to the development of Apache Pinot and Presto at LinkedIn. Before joining LinkedIn, he was one of the original developers of Sherpa, a large scale key-value store at Yahoo!. His interests include large scale distributed systems and databases.

    Talk 2 Title: Optimizing Apache Spark SQL at LinkedIn
    Presenter: Fangshi Li
    Abstract: Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as:
    * Improving Dataset performance with automated column pruning
    * Bringing an efficient 2d join algorithm to Spark SQL
    * Fixing join skewness with adaptive execution
    * Enhancing the cost-optimizer with a history-based learning approach
    Bio: Fangshi Li is a software engineer at Linkedin. He has been working on Spark core infrastructure, user libraries, AI solutions, and Spark SQL engine optimizations. He was one of the original developers of Dr. Elephant, the performance tuning tool for Hadoop/Spark.

    Talk 3 Title: Open Source Reliability for Data Lake with Apache Spark
    Presenter: Michael Armbrust (https://databricks.com/speaker/michael-armbrust)
    Abstract: Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

    In this talk, we will cover:
    * What data quality problems Delta helps address
    * How to convert your existing application to Delta Lake
    * How the Delta Lake transaction protocol works internally
    * The Delta Lake roadmap for the next few releases
    * How to get involved!
    Bio: Michael Armbrust is a committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and the Delta Lake open source project. He received his Ph.D. from UC Berkeley in 2013 and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage, and query optimization.

    NOTE: You may need a government-issued ID to enter the premises or the conference room.

    12
  • Koalas: pandas APIs on Apache Spark/Magenta: Exploring the role on ML

    Microsoft Reactor

    This is a cross-posting event organized by our friends at SF PyData Meetup.

    PLEASE RSVP HERE: https://www.meetup.com/San-Francisco-PyData/events/262203749/

    ## Agenda:

    6:00 - 6:45pm: Mingling
    6:45 - 6:50pm: Opening remarks
    6:50 - 7:35pm: Tech-Talk-1: Koalas: pandas APIs on Apache Spark
    7:35 - 8:05pm: Tech-Talk-2: Magenta: Exploring the role of machine learning in creativity
    8:05 - 8:30pm: Mingling
    8:30: Event over!

    #### Abstract
    In this talk, Reynold will present Koalas, a new open source project that was announced at the Spark + AI Summit in April. Koalas is a Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.

    Reynold will demonstrate Koalas' new functionalities since its initial release, discuss its roadmaps, and how he envisions Koalas could become the standard API for large scale data science.

    #### Speaker Bio
    Reynold Xin is a cofounder and Chief Architect at Databricks. In the open source community, Reynold is known as a top contributor to the Apache Spark project, having designed many of its core user-facing APIs and execution engine features. Reynold received a PhD in Computer Science from UC Berkeley, where he worked on large-scale data processing systems.

    ## Magenta: Exploring the role of machine learning in creativity

    #### Abstract:
    Hanoi will be giving an overview on Magenta, an open source team at Google led by Doug Eck, exploring the role of machine learning in the creative process. The talk will cover exciting and recent developments in the world of AI-generated art as well as live musical demos!

    #### Speaker Bio:
    Lamtharn (Hanoi) Hantrakul - AI Resident, Google Brain

    Hanoi - like the towers? No. Like the city? Yes! Hanoi's Thai parents fell in love there and nicknamed him after the charming city. Born and raised in Bangkok, Thailand, he is currently an AI Resident with Google Brain. His research focuses on real-time Neural Audio Synthesis: rendering sound directly using deep neural networks.

    Hanoi strives for technologies that are transcultural at heart; diversifying software, hardware and AI to encompass underrepresented cultures such as those from his home region of Thailand and beyond. He is most proud of fidular, a modular fiddle system he designed and engineered that enables components like resonators and strings to be swapped across cultures. The system is currently on display at the Musical Instruments Museum in Phoenix, AZ, and has been recognized internationally by the A’ Design Award and Core77 Design Award.

    Hanoi hold degrees in Applied Physics and Music Composition, both with Distinction from Yale University. In his MSc thesis at Georgia Tech, he developed machine learning models for an ultrasound sensor that enables amputees to perform high-dexterity tasks like playing piano; an impossible feat using today’s sensors on conventional prosthetics.

    He also writes music under the moniker "yaboi hanoi". Find his tunes on Instagram and Spotify!

    1
  • Bay Area Apache Spark Meetup @ Salesforce SF

    Salesforce East

    Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark, Machine Learning, and Delta Lake at scale from Salesforce and Databricks

    This meetup is hosted and sponsored by Salesforce.

    Agenda:

    6:00 - 6:30 pm: Social Hour with Food & Drinks
    6:30 - 6:35 pm: Introduction & Announcements
    6:35 - 7:15 pm: Tech Talk from Salesforce Einstein
    7:15 - 7:55 pm: Tech Talk from Databricks: Delta Lake
    7:55 - 8:15 pm: Additional Networking, Q&A

    Talk 1 Title: Apache Spark at Salesforce Einstein for Marketing Cloud

    Presenters: Peter Krmpotic and Kexin Xie

    Abstract:

    Salesforce Einstein makes more than 6.5B+ predictions per day to deliver highly personalized customer experiences on behalf of some of the biggest brands in the world and their engineering/product teams have an extensive background in using Apache Spark in production.

    This talk will cover some of their biggest use cases, their successful transition from Apache Hadoop to Apache Spark, valuable insights about using Spark in production and an architecture review for one of their core capabilities.

    Bio: Peter Krmpotic
    Peter Krmpotic is a director of product management for Salesforce Marketing Cloud Einstein and former product leader at Adobe Experience Cloud, BrightEdge and Boost Media. This career experience has given him a comprehensive view of what brands must develop to provide unified customer experiences at scale, the bedrock of any digitally transformed organization. He works with customers to instill customer-centricity to set them up as AI-first companies and leaders of the technology-based emergence of the fourth industrial revolution.

    In his current role at Salesforce, Peter focuses on democratizing artificial intelligence, specifically deep learning, and natural language processing, for the purpose of personalizing customer experiences at scale. He holds MSc in Computer Science from Karlsruhe Institute of Technology (KIT) in Germany

    Bio: Kexin Xie

    At Salesforce, Kexin Xie is responsible for researching and designing the core distributed data processing and machine learning architecture for Marketing Cloud Einstein and Salesforce DMP. He leads a team of data science engineers with a strong focus on continuously improving operational aspects such as performance, fault tolerance, scale, automation, and cost.

    Before Salesforce, Kexin worked for Krux, BigCommerce, NICTA, Brandscreen, Freelancer and Microsoft Research building large-scale machine learning, data mining, real-time bidding, intelligent marketing, anti-fraud and anti-money laundering software systems. Kexin also holds a Ph.D. degree in computer science.

    Talk 2 Title: Open Source Reliability for Data Lake with Apache Spark
    Presenter: Michael Armbrust
    Abstract: Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.

    In this talk, we will cover
    * What data quality problems Delta helps address
    * How to convert your existing application to delta
    * How the Delta transaction protocol works internally
    * The Delta roadmap for the next few releases
    * How to get involved!

    Bio: Michael Armbrust is a committer and PMC member of Apache Spark and the original creator of Spark SQL. He currently leads the team at Databricks that designed and built Structured Streaming and the Delta Lake open source project. He received his Ph.D. from UC Berkeley in 2013 and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage, and query optimization.

    11
  • Spark + AI Summit Meetup @ Moscone Center West, San Francisco

    Moscone Center

    Join us for an evening of Bay Area Apache Spark Meetup at the Spark + AI Summit (https://databricks.com/sparkaisummit/north-america) featuring tech-talks from Google Inc and Workday Inc on Machine Learning.

    Thanks to #WomenInUnifiedAnalytics & Diversity Team at Databricks for sponsoring this meetup.

    (Note: This meetup is open to everyone. You don’t have to be registered for Spark + AI Summit.)

    Agenda:

    6:00 - 6:30 pm Happy Hour: Mingling & Refreshments
    6:30 - 6:40 pm Opening Remarks (Jules Damji, Databricks)
    6:40 - 7:25 pm Talk-1: Paige Bailey, Google
    7:25 - 8:05 pm Talk-2: Madhura Dudhgaonkar, Workday
    8:05 - 9:00 pm More Mingling & Networking

    Talk 1: Details coming soon...
    Abstract:

    Bio:
    Paige Bailey (https://www.linkedin.com/in/dynamicwebpaige) is a Senior TensorFlow Developer Advocate at Google, based in Mountain View, CA. Prior to joining Google, Paige worked as a senior software engineer in the office of the Azure CTO; as a Cloud Developer Advocate for machine learning at Microsoft; and as a data scientist for Chevron in Houston, TX.

    Paige has over a decade of experience using Python for data analysis, five years of experience doing machine learning - and can't wait to show you about the new capabilities in TensorFlow 2.0.

    Talk 2: Machine Learning Products - How do you begin and when do you scale?

    Abstract: So you have heard all the hype around how Machine Learning is going to change the world. But within your business context, where do you start? How do you choose the right use cases to begin with? How do you get leadership buy-in for more investment? And when do you start thinking about scaling your ML Services?

    In this session, you will walk away with an actionable framework to bootstrap and scale an applied machine learning services function. You will see the framework in action through an actual 0 to 1 product journey involving deep learning where we delivered value in record speed in spite of not having a dataset when we started. You will get practical tips on how to make decisions about when and how to scale your capability to scale ML Services and platform.

    Furthermore, how to get leadership buy-in for more investment? You will go back with some counter-intuitive tips that we discovered as a result of productizing ML services over the last 5+ years using a diverse range of technologies: Vision, Language, Graph, Anomaly Detection, Search Relevance, Personalization.

    Bio:
    Madhura Dudhgaonkar (https://www.linkedin.com/in/madhurad) is a Machine Learning leader at Workday passionate about modernizing the future of work. Her team, a pioneer in the Enterprise Machine Learning space, has spent 6+ years building ML products leveraging Vision, Natural Language Processing, Recommendations, Anomaly Detection and more.

    Madhura’s career journey goes from being a hands-on engineer to leading large organizations across SUN Microsystems, Adobe and now Workday. Her background covers building both consumer and enterprise products - the latest of them involving multiple 0 to 1 product journeys leveraging Machine Learning.

    She is considered a thought leader in building ML products and is frequently invited to speak at AI conferences.

    Madhura holds a master’s degree in Math and Computer Science. She also is passionate about building diverse teams and leads diversity and inclusion work via leading a Women at Workday chapter.

    8
  • Bay Area Apache Spark Meetup @ Unravel Data in SF

    Microsoft Reactor

    Let's kick off the New Year 2019 with our first BASM Meetup!

    Join us for an evening of Bay Area Apache Spark Meetup featuring tech-talks about Apache Spark at scale from Unravel Data and Databricks.

    Agenda:

    6:30 - 7:00 pm: Social Hour with Food, Drinks, Beer & Wine
    7:00 - 7:05 pm: Jules Introduction & Announcements
    7:05 - 7:50 pm: Tech Talk from Unravel Data
    8:00 - 8:45 pm: Tech Talk from Databricks
    8:45 - 9:00 pm: Unravel Raffle and Additional Networking, Q&A

    Tech Talk 1: Putting AI to Work on Apache Spark

    Presenter: Shivnath Babu

    Abstract: Apache Spark simplifies AI, but why not use AI to simplify Spark performance and operations management? An AI-driven approach can drastically reduce the time Spark application developers and operations teams spend troubleshooting problems.

    This talk will discuss algorithms that run real-time streaming pipelines as well as build ML models in batch to enable Spark users to automatically solve problems like: (i) fixing a failed Spark application, (ii) auto tuning SLA-bound Spark streaming pipelines, (iii) identifying the best broadcast joins and caching for SparkSQL queries and tables, (iv) picking cost-effective machine types and container sizes to run Spark workloads on the AWS, Azure, and Google cloud; and more.

    Bio: CTO and Co-Founder at Unravel Data Systems and an adjunct professor of computer science at Duke University. Shivnath co-founded Unravel to solve the application management challenges that companies face when they adopt systems like Hadoop and Spark. Unravel originated from the Starfish platform built at Duke, which has been downloaded by over 100 companies. Shivnath has won a US National Science Foundation CAREER Award, three IBM Faculty Awards, and an HP Labs Innovation Research Award.

    Tech Talk 2: Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Training and Inference on Apache Spark

    Presenter: Lu Wang
    Abstract:

    Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models. AI has always been one of the most exciting applications of big data. Project Hydrogen is a major Apache Spark initiative to bring the best AI and big data solutions together. It introduced barrier execution mode to Spark 2.4.0 release to help distributed model training, and it explores optimized data exchange to accelerate distributed model inference.

    In this talk, we will explain why barrier execution mode is needed, how it works, and how to use it to integrate distributed DL training on Spark. We will demonstrate HorovodRunner, the first Spark+AI integration powered by Project Hydrogen. It is based on the Horovod framework developed by Uber and Databricks Runtime 5.0 for Machine Learning.

    We will also share our experience and performance tips on how to combine Pandas UDF from Spark and AI frameworks to scale complex model inference workload.

    Bio: Lu Wang is a software engineer at Databricks. His main research interests are developing high-performance parallel algorithms for scientific computing and applications. He was actively involved in the development of the Project Hydrogen, Spark Deep Learning pipelines, and Spark MLlib since he joined Databricks. Before Databricks, he was working on parallel multigrid linear solvers on exascale parallel machines for solving the linear systems from reservoir simulations at Lawrence Livermore national laboratory. He received his Ph.D. from the Pennsylvania State University in 2014.

    8